andymeneely / chromium-history Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 4.0 19.67 MB

Scripts and data related Chromium's history

Ruby 76.13% JavaScript 0.86% CSS 0.13% HTML 0.98% Shell 2.81% Python 4.23% Makefile 0.09% C++ 2.07% R 12.70%

chromium-history's People

Contributors

Stargazers

Watchers

Forkers

weswigham chubbymaggie fuying95

chromium-history's Issues

Create a basic rake test task

Create a few Test::Unit examples (we can talk about different testing frameworks later) that, given some loaded data, it tests what's in the database. For example, suppose our test data has 3 code reviews, then one unit test would be to make sure we have parsed all of those correctly.

Specify conventions for dev, test, and production environments

I'll take care of this. We need to decide on how we'll use these conventions.

PatchSetFile owner not being set in the loader

After 104de5d, the association is not actually being made - need to put the owner into the patch set model.

Once #35 is done, using the new rake run:verify, to write a test for this in the development environment.

Build an ER diagram for our data

What types of words are used in inspection security fixes?

Can we identify if the discussion was about security based on how they fixed vulnerabilities? Are there any particular words they used?

Convert Git loader to batch imports

Some more investigation needs to be done on this, but speeding up this loader is critical. Once we've completed #45, we can start doing batch imports of the data all at once without having to do any lookups (except for things like Developers)

Ask Chromium dev team for their data

Be sure to mention:

Just their code reviews, not the bugs
Don't need the original patch sets
Is there a time that would be better? Or a better delay?

Better to have a the benchmark done by the time we start (so issue #2 will need to be done first).

Add more indexes via on_optimize

Anything that is used in relationship should have an index. This is a pretty trivial task.

Create a basic rake parse

This is a set of tasks that will go out to our local data source and load any data we have collected.

For this task create a basic schema of a Code Review, with name and Developer owner. Load a pieces of test data.

As we go, we'll have more data sources to parse from, so design it with that mind. Delegate these tasks out to parsing classes and then hook them into the task.

Commit 14df51bb5a7ce0e5a8ecb12b24d845d9b4ae0318 not parsing properly

When I run the git log parser, I get some filepaths not parsing properly. The above commit has this as a filepath: Roll ANGLE to r686. Addresses screen flickering on resize and some regressions on webpages.

Label a developer as OWNER per-file and per-date

This is the second half of splitting #28.

Once #40 is done, improve upon dev.owner? by adding a new method to Developer that is dev.owner?(file,date) which checks if a developer was an owner of a given file at the given time.

This primarily means improving upon your data collection script for OWNERS that will interpret their regular expressions. Maybe just store their expression and be able to evaluate it upon calling dev.owner?(file,date) ? Or, maybe somehow pre-compute and put it into the database.

What are the measurable components of an effective code inspection?

We've been investigating and questioning for a while, so now it's time to examine some ways we can measure an "effective" code review. Let's document all of the pieces of evidence we can identify of an effective code review

For example:

All the reviewers participate
Messages or comments have more than, say, 50 characters
Messages or comments are less about style and more about opinion
Reviewers are OWNERs (or longtime OWNERs)

Investigate social network analysis Gems

We need some algorithm implementations, such as:

Floyd-Warshall
Node and Edge Betweenness (or maybe just calculate from Floyd-Warshall?)
PageRank and other Eigenvector centralities

Visualization for graphs too? Maybe we need a separate tool, maybe output to GraphViz. Maybe:

Write a loader for our Git log file

The loader should minimally pick up the following attributes from chromium-gitlog.txt:
For Commit Model:

Commit hash
Parent commit hash
Author email
Message
Code review (s) (this might be hard to parse in some situations - for example when they quote another code review, pull ID)
Filepaths (by way of the CommitFiles model)
BUG= field (should be an int)
R=field (reviewers)
TEST field (should be text field)
SVN revision (an int, pull ID)

For CommitFiles Model:

Filepaths (text type)
Commit (relation back to Commit)

Don't bother with the churn data - we'll use our own scripts to collect churn if we need it (but we may not even need it)

This also means creating two models: Commit and CommitFiles.

Also: take ~5 commits from the chromium-gitlog.txt on our data repo and commit it to this repo as our test data under test/data/

Figure out rake:optimize

Not sure how we'll build indexes outside of the models. Need to look into how we'll build models without indexes, but then index them.

Put OWNERS into the build system

The end-goal is to have two methods in Developer that look like this:

dev.is_owner? would tell us if this person ever is an OWNER in any file
dev.is_owner?(date) would tell us if this person ever was an OWNER in any file as of that date.

To get there, we'll need a few Owner relations that keep track of the history.

We'll have to go into the Git history to get every copy of each OWNERs file, so perhaps parsing the Gitlog is a prerequisite (#25)

Rietveld and OWNERS

Is the OWNERs file directly tied into their Rietveld installation for LGTMing? Or is it just a convention?

What can the OWNERs files tell us about the structure of the team?

Come up with a list of ways we can use the OWNERs file to measure effective code reviews.

Add CVE, IsInspectingVulnerability field to the system

A CodeReview can optionally have a CVE relation. Given a CSV file of code review issue IDs, parse through it. First update every code review to "false" for this field (maybe that's the default - not sure how postgres does it). Look up the code review by issue ID, and update the field to true. If that code review doesn't exist, that's a problem - flag it on the command line.

This means we need:

A loader
A migration to update the schema
A new CVE relation that can have many code reviews
Code review can have an optional CVE
Code review has a method is_inspecting_vulnerability? that checks if the relation has at least one record.
Some test data (I'll post it here momentarily)

Assume a CSV structure like the Vulnerablities spreadsheet. But feel free to restructure it too (maybe a new code review id on each line?)

Update NVD data points for new CVEs from this past summer

Update our GoogleDoc so we have all of the latest non-embargoed CVE entries for Chromium. Trace them to the code inspections of the fix commits.

Any evidence of Sherriffs and Gardners? Or other roles?

Are there any formal assignments to roles discussed in our glossary?

Any usage of that term? Or similar terms?

Are there any metrics we can use to identify people who use that role? (e.g. constantly on Webkit --> Gardner)

Is any of this self-defined, or assigned by someone else?

Are tree closures communicated anywhere?

What are the subsystems of Chromium?

Can we use the directories as representatives of subsystems? What are the top-level ones we should generally ignore? /webkit? Can we trace every file to its subsystem using the folder structure, or do we need a more manual approach?

Convert all associations to their original keys

Every time we link two models together in a loader, it has to run a full table scan (without indexes) to find the record's primary key ID, then add that ID to the new insert. This is really slowing us down.

Let's just use our foreign keys that came with the data. For example, the PatchSet relationship to CodeReview should be via the issue number, not the PostgreSQL autogenerated primary key.

This task should involve some database migrations (maybe) and modifying the models' associations.

What this enables us to do is just load data in without worrying about keys.

Build a basic rake analysis

With this set of tasks, we want to be able to hook in a question that runs a query on our data and provides the answer.

For this, design an example question, like "What is the average number of participating reviewers on a code review?"

When rake analysis is run, it should:

Assume the data is already parsed & indexed (i.e. it does not depend upon rake parse)
Print out the question
Print out the answer

Who are the security experts in Chromium?

This is a manual investigation, but we can use our data to guide it. Are there explicit security experts, or is it generally known? Who generally handles the security fixes? Is there anyone who is always on a code inspection because they know more about security?

Code Review Loader not working

I'm not able to get rake run working. I already fixed a couple of bugs that were pushed in commits 25a8479 and 34e2b56. But I haven't been able to figure this one out. This is the full trace:

Loading code reviews: rake aborted! Failed to read 17754 bytes from test/data/codereviews/10854242.json. lib/chromium_history/loaders/code_review_loader.rb:12:inload_file'
lib/chromium_history/loaders/code_review_loader.rb:12:in block in load' lib/chromium_history/loaders/code_review_loader.rb:11:ineach'
lib/chromium_history/loaders/code_review_loader.rb:11:in load' lib/tasks/run.rake:30:inblock (4 levels) in <top (required)>'`

Write a rake run:verify for issue 32

Write a verify test for #32. Check the cve? method in the test data. Test both the positive and negative result.

Prototype an ActiveRecord/Rake build system

The order should be:

Rebuild the database schema
Load data
Index data
Update fields with long-running queries
Create analysis tables

Use fake or test data for now.

Let @toroidal-code help you out with this.

Create rake run:verify tests that are specific to a particular environment

Design this to be flexible enough so we're not putting if statements everywhere. Maybe something like a development/ and test/ folders. Tests here are not foreign-key-esque integrity issues, but are hardcoded things like "We have 995 code reviews"

Actually, start with that: our development data should have 4 code reviews (currently), and our test data should have 995 code reviews.

Work with @macrobug on this.

Create a test data set

We can use our 1000 random code review records in our test data set. In our data repository on nitron, it's now under test/. To get a list of the IDs, iterate over "test/random_uniq_review_ids.txt"

Rewrite Loaders to use activerecord-import

Our performance on the test data set is already pretty slow. What we really need is to do bulk inserts, which involves rewriting how we do loading.

I'm thinking that activerecord-import is the way to go, based on this question:

http://stackoverflow.com/questions/15317837/bulk-insert-records-into-active-record-table

Improved test data set

We need a better test data set. Let's take a random sample of 1000 commits this time, figure out their code reviews, then put that into our test data. That way all of our data still links together properly and our foreign key verifys still run properly.

Let's also make sure that that data set includes vulnerability data.

Build a robust JSON scraper for main data collection

We need to evolve our scraper into a robust script for collecting our data.

The scraper should:

Run from the command line with documented parameters using Trollop
Get both the code review and the patchsets
Get both messages and comments
Provide some sort of status or logging via piping to a log file so we aren't relying on stdout
Do one request at a time, with a configurable delay time
Be able to pick up where it left off by just re-running the script. This means logging what's done and then checking that log to determine where to start
We need a separate cronjob that will check if this script is running every half hour and then email/text me if it's gone down.

Question:
Should we store in msgpack, or just plain json? If we need to compress, I'd rather just gzip the file than use msgpack (I've been having trouble with msgpack)

I would like @toroidal-code to lead this, working with @dani5447.

Incoporate developers into the database

Currently, developers are not really handled the way I'd like them to be handled in the database. Throughout the JSON, developers are identified by their email address, and then the name is also provided. I'd like to reduce that redundancy in our database by having one Developer relation, and then everything else relating to it.

This means that, as we're parsing, we'll need to be populating or updating the Developer table. Here's the logic I want to use:

If we are parsing json and we come across an unknown email address, then that results in a new entry in Developer. If no name is available, then name is just blank.
If we come across a known email address (i.e. Developer.find gives us one), and the name is blank, then update it with any names we have. For example, the CC field only has emails, but the Owner field has both name and email.
If we come across a known email address, with a non-blank name, check the two names. If they're different - flag it. Maybe on the command line, or pipe it to an "irregularities.txt" or something. We'll have to figure out what's going on there.

Thus, identify developers by emails, not names. But check the names for inconsistencies just in case.

The following relations will be associated with Developer

CodeReview's CC list
CodeReview's reviewer list
CodeReview's owner
Message's sender
Message's recipients
PatchSet's owner
Comment's author
OWNERs files

Send Alberto Ruby stuff

Develop scrapers and parsers for the NVD data

Revise and double-check our methods for obtaining the traceability of each CVE to its inspection. Revise the scrapers, and update the GoogleDoc. Mark any questionable ones that we need to circle back to and we'll make a new issue for it.

Data collection and parsing for dev.owner? method in Developer model

This is a smaller task from the #28 epic.

Let's first just focus on making a dev.owner? method which returns true if that developer is an OWNER on any file, ever.

This means our data collection script is quite simplistic in its parsing through OWNERS - just look for emails.

But, this does still mean we need to traverse all of the git log and look at every version of every OWNERS file.

Collect this data and save it to a file of your own format. Then we'll parse it and verify it as a part of our build process.

Work with @toroidal-code on this.

Parse emails to exclude + header

Sometimes CC's have a [email protected] - which means that their real email address is just [email protected]. We need to disambiguate whenever we search by email.

This feature needs to be tested within rake:verify

Git log truncates long filepaths - re-collect

Currently our git collection command is this:

git log --pretty=format:":::%n%H%n%an%n%ae%n%ad%n%P%n%s%n%b" --stat --ignore-space-change

But the --stat option appears to truncate long filepaths. We'll need to come up with a better pretty printer so we can collect that data properly.

Put the git log command in the comments in the git_log_loader.

Be sure to write a verify for this task.

When this is done, re-collect the appropriate data files:

Development data
Test data
Real data

Create a representative test data set

Get some actual code reviews with the following:

LGTMs with approval flags
A not LGTM with a disapproval flag
Multiple comments, multiple messages
Many different reviewers
Multiple patchsets

Put the test set in our data repo on nitron.

Test the Git Log Loader on full GitLog

Does it actually work?

Once it does query for all the code review id's we need. We need those IDs for the scraper.

OWNERS files

Examine the OWNERs files

How much do those change? (e.g. how many commits per year?)
How often does one person's name get removed from an OWNERs file?
How many of the AUTHORs are in the sum total of OWNERS, and vice-versa?
How often is one OWNER in multiples OWNERs?

For this you'll need to:

Clone the git repo at git.chromium.org

Put together a regular build script for the production server.

Set this up to be a cron job that checks out our latest code (from master) and does a rake run. Status should be easy to check, and alerts should be easy to build.

Benchmark JSON scraping

Get reasonable estimates for the following:

If we had a 0.5-second delay between each request, how long would it take to get each review JSON and it's associated patch sets?
How much space would all this take up?
How much would compression buy us? All compressed into one archive? Different archives?

Do the massive JSON scrape

Be sure to test:

Process dies
Maybe adding new IDs with some overlap
Other stuff I can't think of...

Setup headaches

This is the gripe sheet for our Rails setup woes.

Based on our research questions, what data do we need?

A few questions we need to evaluate against our research questions:

Do we really need to parse the full text of comments and messages? Or should we be collecting something within messages?
Do we really need to parse the TEST field in the git log?
Do we really need to parse the R= field in the git log?

Any other big fields we don't need? I'm thinking about trimming things down for performance here.

Organize Chromium History by Gem conventions

Title says it all.

Associate Developers with CodeReviews and Commits

We've lost the connection between developers and our models. For example, we don't have an association that allows us to do:

CodeReview.take.reviewers

For this task, we need to

Create new tables as needed for many-many relationships. Minimally, we need reviewers and cc.
Establish associations via these new tables. For example, we need a reviewers table that has just a code review and a developer field that would link to the appropriate tables. We don't need a Reviewer model, but the CodeReview has_many association needs to be :through the reviewers

To debug ActiveRecord associations, I strongly suggest using rails console

rake clean

Drops all tables in the schema
Builds the tables in the schema

andymeneely / chromium-history Goto Github PK

chromium-history's People

Contributors

Stargazers

Watchers

Forkers

chromium-history's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs