GithubHelp home page GithubHelp logo

andymeneely / chromium-history Goto Github PK

View Code? Open in Web Editor NEW
11.0 11.0 4.0 19.67 MB

Scripts and data related Chromium's history

Ruby 76.13% JavaScript 0.86% CSS 0.13% HTML 0.98% Shell 2.81% Python 4.23% Makefile 0.09% C++ 2.07% R 12.70%

chromium-history's People

Contributors

alrodrig1 avatar andymeneely avatar cketant avatar dani5447 avatar kaylaerdmann avatar kbaumzie avatar nm6061 avatar nuthanmunaiah avatar sidicarus avatar smt9020 avatar sso7159 avatar tesseradecades avatar toroidal-code avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chromium-history's Issues

Create a basic rake test task

Create a few Test::Unit examples (we can talk about different testing frameworks later) that, given some loaded data, it tests what's in the database. For example, suppose our test data has 3 code reviews, then one unit test would be to make sure we have parsed all of those correctly.

Convert Git loader to batch imports

Some more investigation needs to be done on this, but speeding up this loader is critical. Once we've completed #45, we can start doing batch imports of the data all at once without having to do any lookups (except for things like Developers)

Ask Chromium dev team for their data

Be sure to mention:

  • Just their code reviews, not the bugs
  • Don't need the original patch sets
  • Is there a time that would be better? Or a better delay?

Better to have a the benchmark done by the time we start (so issue #2 will need to be done first).

Create a basic rake parse

This is a set of tasks that will go out to our local data source and load any data we have collected.

For this task create a basic schema of a Code Review, with name and Developer owner. Load a pieces of test data.

As we go, we'll have more data sources to parse from, so design it with that mind. Delegate these tasks out to parsing classes and then hook them into the task.

Label a developer as OWNER per-file and per-date

This is the second half of splitting #28.

Once #40 is done, improve upon dev.owner? by adding a new method to Developer that is dev.owner?(file,date) which checks if a developer was an owner of a given file at the given time.

This primarily means improving upon your data collection script for OWNERS that will interpret their regular expressions. Maybe just store their expression and be able to evaluate it upon calling dev.owner?(file,date) ? Or, maybe somehow pre-compute and put it into the database.

What are the measurable components of an effective code inspection?

We've been investigating and questioning for a while, so now it's time to examine some ways we can measure an "effective" code review. Let's document all of the pieces of evidence we can identify of an effective code review

For example:

  • All the reviewers participate
  • Messages or comments have more than, say, 50 characters
  • Messages or comments are less about style and more about opinion
  • Reviewers are OWNERs (or longtime OWNERs)

Write a loader for our Git log file

The loader should minimally pick up the following attributes from chromium-gitlog.txt:
For Commit Model:

  • Commit hash
  • Parent commit hash
  • Author email
  • Message
  • Code review (s) (this might be hard to parse in some situations - for example when they quote another code review, pull ID)
  • Filepaths (by way of the CommitFiles model)
  • BUG= field (should be an int)
  • R=field (reviewers)
  • TEST field (should be text field)
  • SVN revision (an int, pull ID)

For CommitFiles Model:

  • Filepaths (text type)
  • Commit (relation back to Commit)

Don't bother with the churn data - we'll use our own scripts to collect churn if we need it (but we may not even need it)

This also means creating two models: Commit and CommitFiles.

Also: take ~5 commits from the chromium-gitlog.txt on our data repo and commit it to this repo as our test data under test/data/

Figure out rake:optimize

Not sure how we'll build indexes outside of the models. Need to look into how we'll build models without indexes, but then index them.

Put OWNERS into the build system

The end-goal is to have two methods in Developer that look like this:

dev.is_owner? would tell us if this person ever is an OWNER in any file
dev.is_owner?(date) would tell us if this person ever was an OWNER in any file as of that date.

To get there, we'll need a few Owner relations that keep track of the history.

We'll have to go into the Git history to get every copy of each OWNERs file, so perhaps parsing the Gitlog is a prerequisite (#25)

Rietveld and OWNERS

Is the OWNERs file directly tied into their Rietveld installation for LGTMing? Or is it just a convention?

Add CVE, IsInspectingVulnerability field to the system

A CodeReview can optionally have a CVE relation. Given a CSV file of code review issue IDs, parse through it. First update every code review to "false" for this field (maybe that's the default - not sure how postgres does it). Look up the code review by issue ID, and update the field to true. If that code review doesn't exist, that's a problem - flag it on the command line.

This means we need:

  • A loader
  • A migration to update the schema
  • A new CVE relation that can have many code reviews
  • Code review can have an optional CVE
  • Code review has a method is_inspecting_vulnerability? that checks if the relation has at least one record.
  • Some test data (I'll post it here momentarily)

Assume a CSV structure like the Vulnerablities spreadsheet. But feel free to restructure it too (maybe a new code review id on each line?)

Any evidence of Sherriffs and Gardners? Or other roles?

Are there any formal assignments to roles discussed in our glossary?

Any usage of that term? Or similar terms?

Are there any metrics we can use to identify people who use that role? (e.g. constantly on Webkit --> Gardner)

Is any of this self-defined, or assigned by someone else?

Are tree closures communicated anywhere?

What are the subsystems of Chromium?

Can we use the directories as representatives of subsystems? What are the top-level ones we should generally ignore? /webkit? Can we trace every file to its subsystem using the folder structure, or do we need a more manual approach?

Convert all associations to their original keys

Every time we link two models together in a loader, it has to run a full table scan (without indexes) to find the record's primary key ID, then add that ID to the new insert. This is really slowing us down.

Let's just use our foreign keys that came with the data. For example, the PatchSet relationship to CodeReview should be via the issue number, not the PostgreSQL autogenerated primary key.

This task should involve some database migrations (maybe) and modifying the models' associations.

What this enables us to do is just load data in without worrying about keys.

Build a basic rake analysis

With this set of tasks, we want to be able to hook in a question that runs a query on our data and provides the answer.

For this, design an example question, like "What is the average number of participating reviewers on a code review?"

When rake analysis is run, it should:

  • Assume the data is already parsed & indexed (i.e. it does not depend upon rake parse)
  • Print out the question
  • Print out the answer

Who are the security experts in Chromium?

This is a manual investigation, but we can use our data to guide it. Are there explicit security experts, or is it generally known? Who generally handles the security fixes? Is there anyone who is always on a code inspection because they know more about security?

Code Review Loader not working

I'm not able to get rake run working. I already fixed a couple of bugs that were pushed in commits 25a8479 and 34e2b56. But I haven't been able to figure this one out. This is the full trace:

Loading code reviews: rake aborted! Failed to read 17754 bytes from test/data/codereviews/10854242.json. lib/chromium_history/loaders/code_review_loader.rb:12:inload_file'
lib/chromium_history/loaders/code_review_loader.rb:12:in block in load' lib/chromium_history/loaders/code_review_loader.rb:11:ineach'
lib/chromium_history/loaders/code_review_loader.rb:11:in load' lib/tasks/run.rake:30:inblock (4 levels) in <top (required)>'`

Create rake run:verify tests that are specific to a particular environment

Design this to be flexible enough so we're not putting if statements everywhere. Maybe something like a development/ and test/ folders. Tests here are not foreign-key-esque integrity issues, but are hardcoded things like "We have 995 code reviews"

Actually, start with that: our development data should have 4 code reviews (currently), and our test data should have 995 code reviews.

Work with @macrobug on this.

Create a test data set

We can use our 1000 random code review records in our test data set. In our data repository on nitron, it's now under test/. To get a list of the IDs, iterate over "test/random_uniq_review_ids.txt"

Improved test data set

We need a better test data set. Let's take a random sample of 1000 commits this time, figure out their code reviews, then put that into our test data. That way all of our data still links together properly and our foreign key verifys still run properly.

Let's also make sure that that data set includes vulnerability data.

Build a robust JSON scraper for main data collection

We need to evolve our scraper into a robust script for collecting our data.

The scraper should:

  • Run from the command line with documented parameters using Trollop
  • Get both the code review and the patchsets
  • Get both messages and comments
  • Provide some sort of status or logging via piping to a log file so we aren't relying on stdout
  • Do one request at a time, with a configurable delay time
  • Be able to pick up where it left off by just re-running the script. This means logging what's done and then checking that log to determine where to start
  • We need a separate cronjob that will check if this script is running every half hour and then email/text me if it's gone down.

Question:
Should we store in msgpack, or just plain json? If we need to compress, I'd rather just gzip the file than use msgpack (I've been having trouble with msgpack)

I would like @toroidal-code to lead this, working with @dani5447.

Incoporate developers into the database

Currently, developers are not really handled the way I'd like them to be handled in the database. Throughout the JSON, developers are identified by their email address, and then the name is also provided. I'd like to reduce that redundancy in our database by having one Developer relation, and then everything else relating to it.

This means that, as we're parsing, we'll need to be populating or updating the Developer table. Here's the logic I want to use:

  • If we are parsing json and we come across an unknown email address, then that results in a new entry in Developer. If no name is available, then name is just blank.
  • If we come across a known email address (i.e. Developer.find gives us one), and the name is blank, then update it with any names we have. For example, the CC field only has emails, but the Owner field has both name and email.
  • If we come across a known email address, with a non-blank name, check the two names. If they're different - flag it. Maybe on the command line, or pipe it to an "irregularities.txt" or something. We'll have to figure out what's going on there.

Thus, identify developers by emails, not names. But check the names for inconsistencies just in case.

The following relations will be associated with Developer

  • CodeReview's CC list
  • CodeReview's reviewer list
  • CodeReview's owner
  • Message's sender
  • Message's recipients
  • PatchSet's owner
  • Comment's author
  • OWNERs files

Develop scrapers and parsers for the NVD data

Revise and double-check our methods for obtaining the traceability of each CVE to its inspection. Revise the scrapers, and update the GoogleDoc. Mark any questionable ones that we need to circle back to and we'll make a new issue for it.

Data collection and parsing for dev.owner? method in Developer model

This is a smaller task from the #28 epic.

Let's first just focus on making a dev.owner? method which returns true if that developer is an OWNER on any file, ever.

This means our data collection script is quite simplistic in its parsing through OWNERS - just look for emails.

But, this does still mean we need to traverse all of the git log and look at every version of every OWNERS file.

Collect this data and save it to a file of your own format. Then we'll parse it and verify it as a part of our build process.

Work with @toroidal-code on this.

Git log truncates long filepaths - re-collect

Currently our git collection command is this:

git log --pretty=format:":::%n%H%n%an%n%ae%n%ad%n%P%n%s%n%b" --stat --ignore-space-change

But the --stat option appears to truncate long filepaths. We'll need to come up with a better pretty printer so we can collect that data properly.

Put the git log command in the comments in the git_log_loader.

Be sure to write a verify for this task.

When this is done, re-collect the appropriate data files:

  • Development data
  • Test data
  • Real data

Create a representative test data set

Get some actual code reviews with the following:

  • LGTMs with approval flags
  • A not LGTM with a disapproval flag
  • Multiple comments, multiple messages
  • Many different reviewers
  • Multiple patchsets

Put the test set in our data repo on nitron.

OWNERS files

Examine the OWNERs files

  • How much do those change? (e.g. how many commits per year?)
  • How often does one person's name get removed from an OWNERs file?
  • How many of the AUTHORs are in the sum total of OWNERS, and vice-versa?
  • How often is one OWNER in multiples OWNERs?

For this you'll need to:

  • Clone the git repo at git.chromium.org

Benchmark JSON scraping

Get reasonable estimates for the following:

  • If we had a 0.5-second delay between each request, how long would it take to get each review JSON and it's associated patch sets?
  • How much space would all this take up?
  • How much would compression buy us? All compressed into one archive? Different archives?

Based on our research questions, what data do we need?

A few questions we need to evaluate against our research questions:

  • Do we really need to parse the full text of comments and messages? Or should we be collecting something within messages?
  • Do we really need to parse the TEST field in the git log?
  • Do we really need to parse the R= field in the git log?

Any other big fields we don't need? I'm thinking about trimming things down for performance here.

Associate Developers with CodeReviews and Commits

We've lost the connection between developers and our models. For example, we don't have an association that allows us to do:

CodeReview.take.reviewers

For this task, we need to

  • Create new tables as needed for many-many relationships. Minimally, we need reviewers and cc.
  • Establish associations via these new tables. For example, we need a reviewers table that has just a code review and a developer field that would link to the appropriate tables. We don't need a Reviewer model, but the CodeReview has_many association needs to be :through the reviewers

To debug ActiveRecord associations, I strongly suggest using rails console

rake clean

  • Drops all tables in the schema
  • Builds the tables in the schema

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.