andymeneely / chromium-history Goto Github PK
View Code? Open in Web Editor NEWScripts and data related Chromium's history
Scripts and data related Chromium's history
Create a few Test::Unit examples (we can talk about different testing frameworks later) that, given some loaded data, it tests what's in the database. For example, suppose our test data has 3 code reviews, then one unit test would be to make sure we have parsed all of those correctly.
I'll take care of this. We need to decide on how we'll use these conventions.
Can we identify if the discussion was about security based on how they fixed vulnerabilities? Are there any particular words they used?
Some more investigation needs to be done on this, but speeding up this loader is critical. Once we've completed #45, we can start doing batch imports of the data all at once without having to do any lookups (except for things like Developers)
Be sure to mention:
Better to have a the benchmark done by the time we start (so issue #2 will need to be done first).
Anything that is used in relationship should have an index. This is a pretty trivial task.
This is a set of tasks that will go out to our local data source and load any data we have collected.
For this task create a basic schema of a Code Review, with name and Developer owner. Load a pieces of test data.
As we go, we'll have more data sources to parse from, so design it with that mind. Delegate these tasks out to parsing classes and then hook them into the task.
When I run the git log parser, I get some filepaths not parsing properly. The above commit has this as a filepath: Roll ANGLE to r686. Addresses screen flickering on resize and some regressions on webpages.
This is the second half of splitting #28.
Once #40 is done, improve upon dev.owner?
by adding a new method to Developer that is dev.owner?(file,date)
which checks if a developer was an owner of a given file at the given time.
This primarily means improving upon your data collection script for OWNERS that will interpret their regular expressions. Maybe just store their expression and be able to evaluate it upon calling dev.owner?(file,date)
? Or, maybe somehow pre-compute and put it into the database.
We've been investigating and questioning for a while, so now it's time to examine some ways we can measure an "effective" code review. Let's document all of the pieces of evidence we can identify of an effective code review
For example:
We need some algorithm implementations, such as:
Visualization for graphs too? Maybe we need a separate tool, maybe output to GraphViz. Maybe:
The loader should minimally pick up the following attributes from chromium-gitlog.txt:
For Commit Model:
For CommitFiles Model:
Don't bother with the churn data - we'll use our own scripts to collect churn if we need it (but we may not even need it)
This also means creating two models: Commit and CommitFiles.
Also: take ~5 commits from the chromium-gitlog.txt on our data repo and commit it to this repo as our test data under test/data/
Not sure how we'll build indexes outside of the models. Need to look into how we'll build models without indexes, but then index them.
The end-goal is to have two methods in Developer that look like this:
dev.is_owner?
would tell us if this person ever is an OWNER in any file
dev.is_owner?(date)
would tell us if this person ever was an OWNER in any file as of that date.
To get there, we'll need a few Owner relations that keep track of the history.
We'll have to go into the Git history to get every copy of each OWNERs file, so perhaps parsing the Gitlog is a prerequisite (#25)
Is the OWNERs file directly tied into their Rietveld installation for LGTMing? Or is it just a convention?
Come up with a list of ways we can use the OWNERs file to measure effective code reviews.
A CodeReview can optionally have a CVE relation. Given a CSV file of code review issue IDs, parse through it. First update every code review to "false" for this field (maybe that's the default - not sure how postgres does it). Look up the code review by issue ID, and update the field to true. If that code review doesn't exist, that's a problem - flag it on the command line.
This means we need:
Assume a CSV structure like the Vulnerablities spreadsheet. But feel free to restructure it too (maybe a new code review id on each line?)
Update our GoogleDoc so we have all of the latest non-embargoed CVE entries for Chromium. Trace them to the code inspections of the fix commits.
Are there any formal assignments to roles discussed in our glossary?
Any usage of that term? Or similar terms?
Are there any metrics we can use to identify people who use that role? (e.g. constantly on Webkit --> Gardner)
Is any of this self-defined, or assigned by someone else?
Are tree closures communicated anywhere?
Can we use the directories as representatives of subsystems? What are the top-level ones we should generally ignore? /webkit? Can we trace every file to its subsystem using the folder structure, or do we need a more manual approach?
Every time we link two models together in a loader, it has to run a full table scan (without indexes) to find the record's primary key ID, then add that ID to the new insert. This is really slowing us down.
Let's just use our foreign keys that came with the data. For example, the PatchSet
relationship to CodeReview
should be via the issue
number, not the PostgreSQL autogenerated primary key.
This task should involve some database migrations (maybe) and modifying the models' associations.
What this enables us to do is just load data in without worrying about keys.
With this set of tasks, we want to be able to hook in a question that runs a query on our data and provides the answer.
For this, design an example question, like "What is the average number of participating reviewers on a code review?"
When rake analysis
is run, it should:
rake parse
)This is a manual investigation, but we can use our data to guide it. Are there explicit security experts, or is it generally known? Who generally handles the security fixes? Is there anyone who is always on a code inspection because they know more about security?
I'm not able to get rake run
working. I already fixed a couple of bugs that were pushed in commits 25a8479 and 34e2b56. But I haven't been able to figure this one out. This is the full trace:
Loading code reviews: rake aborted! Failed to read 17754 bytes from test/data/codereviews/10854242.json. lib/chromium_history/loaders/code_review_loader.rb:12:in
load_file'
lib/chromium_history/loaders/code_review_loader.rb:12:in block in load' lib/chromium_history/loaders/code_review_loader.rb:11:in
each'
lib/chromium_history/loaders/code_review_loader.rb:11:in load' lib/tasks/run.rake:30:in
block (4 levels) in <top (required)>'`
Write a verify test for #32. Check the cve?
method in the test data. Test both the positive and negative result.
The order should be:
Use fake or test data for now.
Let @toroidal-code help you out with this.
Design this to be flexible enough so we're not putting if statements everywhere. Maybe something like a development/
and test/
folders. Tests here are not foreign-key-esque integrity issues, but are hardcoded things like "We have 995 code reviews"
Actually, start with that: our development data should have 4 code reviews (currently), and our test data should have 995 code reviews.
Work with @macrobug on this.
We can use our 1000 random code review records in our test data set. In our data repository on nitron, it's now under test/. To get a list of the IDs, iterate over "test/random_uniq_review_ids.txt"
Our performance on the test data set is already pretty slow. What we really need is to do bulk inserts, which involves rewriting how we do loading.
I'm thinking that activerecord-import is the way to go, based on this question:
http://stackoverflow.com/questions/15317837/bulk-insert-records-into-active-record-table
We need a better test data set. Let's take a random sample of 1000 commits this time, figure out their code reviews, then put that into our test data. That way all of our data still links together properly and our foreign key verifys still run properly.
Let's also make sure that that data set includes vulnerability data.
We need to evolve our scraper into a robust script for collecting our data.
The scraper should:
Question:
Should we store in msgpack, or just plain json? If we need to compress, I'd rather just gzip the file than use msgpack (I've been having trouble with msgpack)
I would like @toroidal-code to lead this, working with @dani5447.
Currently, developers are not really handled the way I'd like them to be handled in the database. Throughout the JSON, developers are identified by their email address, and then the name is also provided. I'd like to reduce that redundancy in our database by having one Developer relation, and then everything else relating to it.
This means that, as we're parsing, we'll need to be populating or updating the Developer table. Here's the logic I want to use:
Thus, identify developers by emails, not names. But check the names for inconsistencies just in case.
The following relations will be associated with Developer
Revise and double-check our methods for obtaining the traceability of each CVE to its inspection. Revise the scrapers, and update the GoogleDoc. Mark any questionable ones that we need to circle back to and we'll make a new issue for it.
This is a smaller task from the #28 epic.
Let's first just focus on making a dev.owner?
method which returns true if that developer is an OWNER on any file, ever.
This means our data collection script is quite simplistic in its parsing through OWNERS - just look for emails.
But, this does still mean we need to traverse all of the git log and look at every version of every OWNERS file.
Collect this data and save it to a file of your own format. Then we'll parse it and verify it as a part of our build process.
Work with @toroidal-code on this.
Sometimes CC's have a [email protected] - which means that their real email address is just [email protected]. We need to disambiguate whenever we search by email.
This feature needs to be tested within rake:verify
Currently our git collection command is this:
git log --pretty=format:":::%n%H%n%an%n%ae%n%ad%n%P%n%s%n%b" --stat --ignore-space-change
But the --stat
option appears to truncate long filepaths. We'll need to come up with a better pretty printer so we can collect that data properly.
Put the git log command in the comments in the git_log_loader.
Be sure to write a verify for this task.
When this is done, re-collect the appropriate data files:
Get some actual code reviews with the following:
Put the test set in our data repo on nitron.
Does it actually work?
Once it does query for all the code review id's we need. We need those IDs for the scraper.
Examine the OWNERs files
For this you'll need to:
Set this up to be a cron job that checks out our latest code (from master) and does a rake run. Status should be easy to check, and alerts should be easy to build.
Get reasonable estimates for the following:
Be sure to test:
This is the gripe sheet for our Rails setup woes.
A few questions we need to evaluate against our research questions:
Any other big fields we don't need? I'm thinking about trimming things down for performance here.
Title says it all.
We've lost the connection between developers and our models. For example, we don't have an association that allows us to do:
CodeReview.take.reviewers
For this task, we need to
reviewers
and cc
.reviewers
table that has just a code review and a developer field that would link to the appropriate tables. We don't need a Reviewer model, but the CodeReview has_many
association needs to be :through
the reviewers
To debug ActiveRecord associations, I strongly suggest using rails console
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.