GithubHelp home page GithubHelp logo

brianwarner / facade Goto Github PK

View Code? Open in Web Editor NEW
27.0 27.0 10.0 1.11 MB

See who is actually doing the work in your projects

Home Page: http://facade-oss.org

License: Apache License 2.0

PHP 41.54% Shell 1.39% CSS 1.05% Python 56.02%

facade's People

Contributors

brianwarner avatar lukaszgryglicki avatar mkdolan avatar ryjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

facade's Issues

Monitor a cgit page for newly added repos

Currently repos must be manually managed. When a project adds a new repo, somebody has to notice and take action to add the repo to facade.

It would be useful to have a feature that monitors a cgit page and detects changes. When a change is detected, it should be flagged (at minimum) on the project's detail page.

I don't think we want to add new repos automatically, in case the intent is to just monitor a subset.

HTML_unescape deprecated in Python 3.9

When executing facade-worker.py, I was getting an error message that html.unescape had been deprecated in Python 3.9. This was easily solved by adding 'from html import unescape’ and changing ‘html.unescape’ to ‘unescape’ in the facade-worker.py file.

Analysis fails if git repo has trailing slash

When cloning a repo with a trailing slash, Facade can't find the .git directory. This is because cloning a repo like this:

git clone git://domain.com/git/reponame

puts the repo into a directory called reponame, but doing the same like this:

git clone git://domain.com/git/reponame/

puts the repo into reponame/reponame.

This can be fixed by trimming the trailing slash.

facade-worker.py is not aware of gitdm EmailMap files with emails, only domains

At the end of a run, facade-worker.py checks to see if gitdm's map files have changed recently. If they have, it figures out what is different, and attempts to correct any historical (Unknown) entries in the database.

At this point it only understands domain -> employer mappings, not email -> employer. This should really be fixed to account for developers who don't use their employer's email.

Bulk import of projects and repos

Some projects have massive lists of git repos, e.g., OpenStack and xorg. Putting these in manually would be time consuming. It would be really helpful to have a standard way to bulk import projects and repos.

Reduce database transactions during analysis

The analysis portion of a facade-worker.py run is very database intensive, for a number of reasons. When designing the analysis functions, I wanted to be able to log neurotically, and stashed data as soon as it was computed so that if/when facade-worker.py failed, very little data would need to be recalculated. (FWIW recovery from an unplanned exit isn't an issue, as facade-worker.py just sees where it needs to pick up and resumes from there)

However, when a commit has a lot of files or there are a lot of commits in an analysis, the repeated database access can really slow things down. The mysqldb module is supposed to be the fastest connector, but at a certain point the sheer volume is the issue.

One potential solution would be to accumulate analyzed data in a temporary in-memory database, and then write everything out to storage in one big transaction at the end of a repo analysis. This should have minimal impact on short runs, but potentially a much larger impact on long runs.

One other massive advantage of reducing database transactions is that it could give us the option to use the native python MySQL library, pymysql. In my tests pymysql is considerably slower than mysqldb for an individual transaction, but pymysql is necessary if we want to use PyPy. In past testing PyPy runs were slower, which is counterintuitive. The best explanation I can think of is that PyPy was hamstrung by the number of database transactions. There will be a push/pull performance tension here, but there's a pretty good chance that if we optimize database transactions, the gains from PyPy will make up for pymysql's latency.

Divide analysis_data table, create new commits table

Per discussion with @sgoggins, this issue proposes some major structural changes to the way commit data is stored. It was triggered by discussion around #33

For a while I have wanted to optimize analysis_data. Each row in the table contains info on each file that changed in a commit. Each row also contains its own copy of author and committer info. When a commit changes a single file, it's not really a big deal. But when a commit changes a lot of files, there's a lot of duplication in the metadata.

There is some benefit in breaking this info out into a separate table, called commits. It would reduce the overall size of analysis_data (I haven't run into issues with this yet, but I'm not using it at the same scale as Sean, see #31 ). It would also yield a graceful solution for #33 by providing us the ability to start over, storing dates as a native DATETIME rather than in ISO 8601 format as a VARCHAR.

In addition, it also gives us a new central place to store the commit message, which may be useful info.

The main changes required are:

  • Alter setup.py to move these columns out of analysis_data and into a new commits table
  • Add a clause to the function update_db in facade-worker.py to add the new commits table, copy over commit and author/committer info, remove old columns from analysis_data and optimize it, and then do a cursory walk through the git log of each repo to get full datetime info for authors/committers plus commit messages.
  • Update the caching functions with the new join between analysis_data and commits
  • Add the ability to view commit messages to various UIs
  • Cut a new major release, because this is a significant database change

While this is a big change, in theory it should be possible to do all of the changes transparently to a user with an existing database. The first facade-worker.py run after pulling this code will take longer than usual, but that's likely the only impact.

Feature: Enhancement to landing page UI

Facade instances with a large number of projects will have a very lengthy landing page. It would be nice to have an alternative way to list the projects on the home page as a table, with 4 or 5 projects per line, cubby-style.

Fix the ability to import repos in bulk from GitHub

At one point Facade could get a list of repos directly for a given GitHub user or project, and import them in bulk. This either broke, or never worked properly in the first place. The current code (which is commented out in projects.php and manage.php) seems to only pull the first page of repos.

The changes will be in manage.php, and are probably just fixing the way the GitHub API is called.

nginx support?

Any guidance on using nginx for the web server instead of apache?

Summary data should be cached to reduce page load times.

Pages with summary tables are really slow to load if you have a lot of data, or are trying to run facade on a wimpy old machine under your desk.

It's just as accurate (and infinitely more efficient) to cache this data at the end of facade-worker.py.

This will require:

  • Adding a cache table to the database
  • Adding the summary data calculations in facade-worker.py
  • Removing the summary data calculations from includes/display.php
  • Adding support to includes/display.php to pull from the cache instead.

Feature: Improve user authentication

Authentication is quite basic at this point, mainly intended to allow you to show someone else the results without them changing the configurations or accidentally deleting important things. Facade simply uses PHP sessions to determine if someone is logged in or not. There's no way to stay logged in, for example. Also, there may be much more secure ways to do this that I'm not aware of.

I would welcome input from someone smarter than me on how to improve the robustness of user authentication.

Feature: Ability to adding git repo as part of project info

For most projects, we need to track a single repo. It would be a nice enhancement to have the ability to enter a repo as part of creating new project. this will save few clicks (ie going into adding repo windows and doing it there - which will be the case for tracking for than 1 repo).

Feature: author email equivalence table for grouping commits from multiple addresses

I would like to request a feature that would allow a Facade administrator to list equivalent email addresses for a person's primary email address. Here are use cases:

  1. Our git repo has some errors in it, where people mis-configured their git client and added commits with invalid email addresses. I would like to associate any invalid address(es) with the valid address, so the domain report doesn't show "unknown affiliation."
  2. Our git repo has commits from people who used both an external and internal email address in their commits. Again I would like to list the equivalences so "[email protected]" and "[email protected]" are grouped together for purposes of showing statistics.

Thanks in advance for considering it.

Store author/committer date as DATETIME instead of VARCHAR

Per @sgoggins' request, this feature would alter the tables to store the committer and author dates in a DATETIME column instead of VARCHAR. The context of the original decision is that the date info comes in as text from git log, and there wasn't any reason to do it otherwise. So, basically, laziness on my part...

Sean now has a reason, so this issue will track the downstream consequences of storing this commit data as a DATETIME. At minimum, it will require:

  • Changing the column definitions for data and cache tables in setup.py
  • Adding a clause to the function update_db() in facade-worker.py so that existing instances are properly updated
  • Making sure everything displays properly in the web and cli views
  • A new major release.

There are a few ways to do this. One is to alter the analysis_data table in-place, and force a recreation of the data. Another way is to split this data out separately (this appears preferable, as it will come with other benefits, and is tracked in a different issue).

Merging Łukasz' fork

@lukaszgryglicki,

I had a look back to where Facade was when you forked it. Here are a few hints on what's changed since then.

  1. setup.py is much more automated. It can now create database, users, and tables automatically, and it generates the database connection files automatically. It can accept user-entered values for the database name, user, and password, or it randomly generates them.

  2. You can now use a shared database for affiliation and alias information. If you have a few instances of Facade, it is a lot easier when you don't have to keep the affiliation and alias info in sync between them. This also required some new logic in facade-worker.py to detect changes to affiliation and alias data made by other instances, and allowed me to remove it from the web frontend stuff. But, it also means that when you import gitdm files or make a change directly in the database, you no longer have to nuke your affiliations.

  3. Some of the tables have changed. analysis_data now holds the raw author and committer email, as it was in the patch. These are in author_raw_email and committer_raw_email. Also, author_email and committer_email now hold the aliased email. This is so if you delete an alias mapping, we can roll the author/committer email back to what it was in the patch, rather than holding onto it.

  4. Some changes to columns in database tables. Your best bet is to export your current configs, set up a new instance of facade, and re-import the configs. This process should work even despite the column changes, as I've tried to keep it from breaking. However, if it does break, ping me.

  5. By default, git repos clone into a subdirectory of facade now, instead of into /opt. This seemed to make a lot more sense, reduces the risk of collisions and makes it easy to clean up after a short-lived analysis.

I think that's it.


One request I have when you are merging, please attempt to organize new files into subdirectories as makes sense. I'm trying to keep the top level as clean as possible for things related to display, and utilities/ for things related to analysis, as these are two different things. This also makes setting .htaccess rules a little easier.

Also, before I release I should update the coding style to PEP 008 (spaces instead of tabs, even though it kills me to do so). However, I will wait to do this until you've done your merge so we don't end up with more manual merging.

Feature: Support for git branches

Right now Facade just analyzes whatever branch is checked out. It would be nice to be able to check out other branches directly from the UI.

When this happens, Facade will detect any commits which are in the new branch, or are missing, and do the right thing. The main thing that's needed is to detect the remote branches, and a way to check out the new selection the next time facade-worker.py runs.

Feature: Add indices to analysis_data

Łukasz noted that the lack of indices on analysis_data was making it very slow. Early testing indicates orders of magnitude better performance when filling affiliations.

Feature: Add % to summary statistics

It would be helpful to have a % view of summary statistics by time period.

However, it should be presented in a way which doesn't comingle the % with the raw numbers, so it doesn't confuse Excel when copying/pasting results directly from the website. Possibly a set of radio buttons that allow you to select the view: "raw numbers", "%", or "raw numbers (%)"

Feature: report comment statistics by author as reviewer

Please extend Facade to include descriptive statistics about the comments posted by a person in Gerrit. Some people are very active as reviewers but less so as committers, and the stats today focus on commits. Comments are valuable, and showing stats about comments would highlight the value of reviewers in an open-source community.

Feature: Generate contributor graphs

For the sake of copying, pasting, and less time spent in Excel, it would be nice to have nice-looking graphs of contributors generated directly in Facade. Depending upon how resource intensive they are, it could either make sense to pre-generate them at the end of a facade-worker.py run. Or if they can be done easily for large data sets, generate them on the fly.

Feature: Kiosk mode

Right now, Facade is mostly intended to be run internally, behind a firewall. The authentication is pretty basic (see issue #16), so this should certainly be addressed before putting this out on the open web.

An alternative, though, would be to have a read-only kiosk mode. The simplest way to do this (I think) would be to cache the pages as boring old HTML files, which could then be periodically transferred to a public server. Every time facade-worker.py runs, it would overwrite the old cached files. This has the advantage of being completely non-interactive and should also be quite fast. But I'm open to ideas.

Password generation bug

If the password ends with an "!" the php parser fails to interpret properly and the techy user spends 3 hours debugging in order to find the issue. #12YearsCatholicSchool #GuiltPro .. actually a very obtuse error condition... so, needless to say i am impressed with myself for finding it.

The fix is to enclose the password in quotation marks.

and then I loaded 300 projects / 6,815 repos into facade and tried building cache 🤣

128 gig of ram, solid state drives …. got all the repos and analysis_data .. its been 21 hours loading project_weekly_cache .. no cpu usage, so I am guessing I have the database eating disk …

I’ve already made a set of modification and database config notes on my fork at sgoggins/facade:augur branch … I’m thinking I an rewrite the query that loads cache to go after one repository or project group at a time .. since this is a nominal, 4 hour thing for me (a very experienced database guy / formerly well compensated Oracle DBA) I thought I would circle back and see if you would approve a pull request that modularized some of the functions in facade-worker.py into another python file. Or how you would recommend doing this.

The refactoring would change how cache is built and have options for execution. I think:

  1. Cache would not be rebuilt at the project level when "recache" is tagged automatically
  2. Cache at the project level would be rebuilt one project at a time
  3. I will explore a process of accumulating project level details from repository level cache, which may require some changes to the repository cache.
  4. Cache would be rebuilt at the repo level one repo at at time.
  5. i would take a parameter that enabled wholesale cache building at the project and repo level as is the case today for smaller scale implementations
  6. I will explore the potential to keep cache without destroying it on each recache

What do you think @brianwarner ?

Report on activity in github, gerrit and other community systems

The user interactions in the comment/review process on github or gerrit are a huge part of the open-source community but git log (thus facade) does not reflect any of that activity. I suppose there are multiple parts here, here's a SWAG:

  1. analyze minimum viable data items to capture essence of github and gerrit reviews (other systems?) generically including merge/pull requests, change sets, votes, comment stats (comment count, comment size, ..) and more. In other words, hopefully there will be a data item "change_count" and not "github_pull_request_count" plus "gerrit_change_set_count".
  2. define database schema that can accommodate the data with least possible duplication
  3. define interfaces in Facade to for storing/extracting/using these new data items
  4. build plug-ins that use Facade interfaces, that query github, gerrit, (other systems?) to pull the review activity data via JSON or whatever and push to facade
  5. build plug-in (hopefully singular, not specific to gerrit/github) to analyze and publish the review activity data

This Apache-licensed project for github might have usable code, last commit 2018 https://github.com/Netflix/osstracker

This MIT-licensed project for gerrit might have usable code, also last commit 2018 https://github.com/holmari/gerritstats

facade-worker.py - no SQL escaping

If there is a ' or other special character in company name (I've imported rather big mappings from my other project, and some companies have ' in their names) then affiliations fails with SQL error.
This is due to usage of code like this:

insert = ("INSERT INTO some_table (col1, col2) VALUES ('%s', '%s')" % (col1, col2))
cursor.execute(insert)

While python / SQL best practice suggests:

insert = ("INSERT INTO some_table (col1, col2) VALUES (%s, %s)")
cursor.execute(insert, (col1, col2))

Please also note that:
If there is a single item to escape, then You must use 1-tupple:
cursor.execute(insert, (col, ))
This will fail !:
cursor.execute(insert, col)

I'm fixing this in my docker fork here: https://github.com/lukaszgryglicki/facade
My fork purpose is to create docker container with preinstalled facade with our @cncf mappings.

So I will create PR fixinig this in a few hours!

I cannot assign myself - but please assigne me if You have rights to do so.

... not speaking about SQL injections ... :P

Repository adding seems obtuse or broken

I created my second project and am following the instructions trying to add repositories to that project, but not seeing a way to do it. I can "upload" a file on one screen, but what structure should the file be in?

./install_deps.sh Fails on Ubuntu 16.04.2

Just attempted an install and got the following. Will try and hunt down the dependencies manually in the meantime:

root@redmonk-labs:~/facade/utilities# ./install_deps.sh
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'php7.0-xml' instead of 'php-dom'
Package python-bcrypt is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python-bcrypt' has no installation candidate
E: Unable to locate package python-xlsxwriter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.