GithubHelp home page GithubHelp logo

src-d / go-license-detector Goto Github PK

View Code? Open in Web Editor NEW
234.0 11.0 39.0 23.09 MB

Reliable project licenses detector.

License: Other

Makefile 2.01% Go 97.99%
spdx-license spdx-licenses spdx license-scan license-management

go-license-detector's Introduction

WE CONTINUE THE DEVELOPMENT AT go-enry/go-license-detector. This repository is abandoned, and no further updates will be done on the code base, nor issue/prs will be answered or attended.

go-license-detector GoDoc Build Status Build status codecov Go Report Card

Project license detector - a command line application and a library, written in Go. It scans the given directory for license files, normalizes and hashes them and outputs all the fuzzy matches with the list of reference texts. The returned names follow SPDX standard. Read the blog post.

Why? There are no similar projects which can be compiled into a native binary without dependencies and also support the whole SPDX license database (≈400 items). This implementation is also fast, requires little memory, and the API is easy to use.

The license texts are taken directly from license-list-data repository. The detection algorithm is not template matching; this directly implies that go-license-detector does not provide any legal guarantees. The intended area of it's usage is data mining.

Installation

export GO111MODULE=on
go mod download
go build -v gopkg.in/src-d/go-license-detector.v3/cmd/license-detector

Contributions

...are welcome, see CONTRIBUTING.md and code of conduct.

License

Apache 2.0, see LICENSE.md.

Algorithm

  1. Find files in the root directory which may represent a license. E.g. LICENSE or license.md.
  2. If the file is Markdown or reStructuredText, render to HTML and then convert to plain text. Original HTML files are also converted.
  3. Normalize the text according to SPDX recommendations.
  4. Split the text into unigrams and build the weighted bag of words.
  5. Calculate Weighted MinHash.
  6. Apply Locality Sensitive Hashing and pick the reference licenses which are close.
  7. For each of the candidate, calculate the Levenshtein distance - D. the corresponding text is the single line with each unigram represented by a single rune (character).
  8. Set the similarity as 1 - D / L where L is the number of unigrams in the quieried license.

This pipeline guarantees constant time queries, though requires some initialization to preprocess the reference licenses.

If there are not license files found:

  1. Look for README files.
  2. If the file is Markdown or reStructuredText, render to HTML and then convert to plain text. Original HTML files are also converted.
  3. Scan for words like "copyright", "license" and "released under". Take the neighborhood.
  4. Run Named Entity Recognition (NER) over that surrounding context and extract the possible license name.
  5. Match it against the list of license names from SPDX.

Usage

Command line:

license-detector /path/to/project
license-detector https://github.com/src-d/go-git

Library (for a single license detection):

import (
    "gopkg.in/src-d/go-license-detector.v3/licensedb"
    "gopkg.in/src-d/go-license-detector.v3/licensedb/filer"
)

func main() {
	licenses, err := licensedb.Detect(filer.FromDirectory("/path/to/project"))
}

Library (for a convenient data structure that can be formatted as JSON):

import (
	"encoding/json"
	"fmt"

	"gopkg.in/src-d/go-license-detector.v3/licensedb"
)

func main() {
	results := licensedb.Analyse("/path/to/project1", "/path/to/project2")
	bytes, err := json.MarshalIndent(results, "", "\t")
	if err != nil {
		fmt.Printf("could not encode result to JSON: %v\n", err)
	}
	fmt.Println(string(bytes))
}

Quality

On the dataset of ~1000 most starred repositories on GitHub as of early February 2018 (list), 99% of the licenses are detected. The analysis of detection failures is going in FAILURES.md.

Comparison to other projects on that dataset:

Detector Detection rate Time to scan, sec
go-license-detector 99% (897/902) 13.5
benbalter/licensee 75% (673/902) 111
google/licenseclassifier 76% (682/902) 907
boyter/lc 88% (797/902) 548
amzn/askalono 87% (785/902) 165
LiD 94% (847/902) 3660
How this was measured
$ cd $(go env GOPATH)/src/gopkg.in/src-d/go-license-detector.v3/licensedb
$ mkdir dataset && cd dataset
$ unzip ../dataset.zip
$ # src-d/go-license-detector
$ time license-detector * \
  | grep -Pzo '\n[-0-9a-zA-Z]+\n\tno license' | grep -Pa '\tno ' | wc -l
$ # benbalter/licensee
$ time ls -1 | xargs -n1 -P4 licensee \
  | grep -E "^License: Other" | wc -l
$ # google/licenseclassifier
$ time find -type f -print | xargs -n1 -P4 identify_license \
  | cut -d/ -f2 | sort | uniq | wc -l
$ # boyter/lc
$ time lc . \
  | grep -vE 'NOASSERTION|----|Directory' | cut -d" " -f1 | sort | uniq | wc -l
$ # amzn/askalono
$ echo '#!/bin/sh
result=$(askalono id "$1")
echo "$1
$result"' > ../askalono.wrapper
$ time find -type f -print | xargs -n1 -P4 sh ../askalono.wrapper | grep -Pzo '.*\nLicense: .*\n' askalono.txt | grep -av "License: " | cut -d/ -f 2 | sort | uniq | wc -l
$ # LiD
$ time license-identifier -I dataset -F csv -O lid
$ cat lid_*.csv | cut -d, -f1 | cut -d"'" -f 2 | grep / | cut -d/ -f2 | sort | uniq | wc -l

Regenerate binary data

The SPDX licenses are included into the binary. To update them, run

make bindata.go

go-license-detector's People

Contributors

adracus avatar aperezg avatar campoy avatar dsymonds avatar erizocosmico avatar johanbrandhorst avatar lafriks avatar marclop avatar mcuadros avatar sebbonnet avatar smola avatar tevino avatar vmarkovtsev avatar zurk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-license-detector's Issues

gomodules: impossible to build when depend on gopkg.in/src-d/go-license-detector.v2

Hello src-d team
@vmarkovtsev please help to solve issue
Right now Im using gopkg.in/src-d/go-license-detector.v2 as my dependency
Currently not possible to pass build when using gomodules because of incorrect dependencies in your go.mod source:

gopkg.in/russross/blackfriday.v2 v2.0.0

blackfriday does not host v2.0.0 anymore
https://gopkg.in/russross/blackfriday.v2
Is it possible to update go-license-detector with v.2.0.1 ?

Issue with go mod

I ran go get -u gopkg.in/src-d/go-license-detector.v2 got:

go: gopkg.in/russross/[email protected]: go.mod has non-....v2 module path "github.com/russross/blackfriday/v2" at revision v2.0.1 on go version go1.12.8 darwin/amd64

Make fails

The make fails on current master.

~/go-license-detector> make
curl -SLk -o license-list-data.tar.gz https://github.com/spdx/license-list-data/archive/v3.0.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   128    0   128    0     0     79      0 --:--:--  0:00:01 --:--:--    79
100 13.6M  100 13.6M    0     0  2249k      0  0:00:06  0:00:06 --:--:-- 3724k
tar -xf license-list-data.tar.gz license-list-data-3.0/text
tar -xf license-list-data.tar.gz license-list-data-3.0/json/details
go run licensedb/internal/assets/extract_urls.go license-list-data-3.0/json/details > urls.csv
go run licensedb/internal/assets/extract_names.go license-list-data-3.0/json/details > names.csv
tar -cf licenses.tar -C license-list-data-3.0/text .
rm -rf license-list-data-3.0
2019/09/10 14:59:14 Listing license-list-data-3.0/json/details: open license-list-data-3.0/json/details: no such file or directory
exit status 1
2019/09/10 14:59:14 Listing license-list-data-3.0/json/details: open license-list-data-3.0/json/details: no such file or directory
exit status 1
make: *** [names.csv] Error 1
make: *** Waiting for unfinished jobs....
make: *** [urls.csv] Error 1

install failed go: error loading module requirements

Hi, I got below error when installing

export GO111MODULE=on
go mod download
go build -v gopkg.in/src-d/go-license-detector.v2/cmd/license-detector
warning: pattern "all" matched no module dependencies
Fetching https://gopkg.in/src-d/go-license-detector.v2/cmd/license-detector?go-get=1
Parsing meta tags from https://gopkg.in/src-d/go-license-detector.v2/cmd/license-detector?go-get=1 (status code 200)
get "gopkg.in/src-d/go-license-detector.v2/cmd/license-detector": found meta tag get.metaImport{Prefix:"gopkg.in/src-d/go-license-detector.v2", VCS:"git", RepoRoot:"https://gopkg.in/src-d/go-license-detector.v2"} at https://gopkg.in/src-d/go-license-detector.v2/cmd/license-detector?go-get=1
get "gopkg.in/src-d/go-license-detector.v2/cmd/license-detector": verifying non-authoritative meta tag
Fetching https://gopkg.in/src-d/go-license-detector.v2?go-get=1
Parsing meta tags from https://gopkg.in/src-d/go-license-detector.v2?go-get=1 (status code 200)
Fetching https://gopkg.in/src-d/go-license-detector.v2/cmd?go-get=1
Parsing meta tags from https://gopkg.in/src-d/go-license-detector.v2/cmd?go-get=1 (status code 200)
get "gopkg.in/src-d/go-license-detector.v2/cmd": found meta tag get.metaImport{Prefix:"gopkg.in/src-d/go-license-detector.v2", VCS:"git", RepoRoot:"https://gopkg.in/src-d/go-license-detector.v2"} at https://gopkg.in/src-d/go-license-detector.v2/cmd?go-get=1
get "gopkg.in/src-d/go-license-detector.v2/cmd": verifying non-authoritative meta tag
Fetching https://gopkg.in/src-d/go-license-detector.v2?go-get=1
Parsing meta tags from https://gopkg.in/src-d/go-license-detector.v2?go-get=1 (status code 200)
get "gopkg.in/src-d/go-license-detector.v2": found meta tag get.metaImport{Prefix:"gopkg.in/src-d/go-license-detector.v2", VCS:"git", RepoRoot:"https://gopkg.in/src-d/go-license-detector.v2"} at https://gopkg.in/src-d/go-license-detector.v2?go-get=1
Fetching https://gopkg.in/russross/blackfriday.v2?go-get=1
Parsing meta tags from https://gopkg.in/russross/blackfriday.v2?go-get=1 (status code 200)
get "gopkg.in/russross/blackfriday.v2": found meta tag get.metaImport{Prefix:"gopkg.in/russross/blackfriday.v2", VCS:"git", RepoRoot:"https://gopkg.in/russross/blackfriday.v2"} at https://gopkg.in/russross/blackfriday.v2?go-get=1
go: gopkg.in/russross/[email protected]: go.mod has non-....v2 module path "github.com/russross/blackfriday/v2" at revision v2.0.1
go: error loading module requirements

how to fixed ? thanks

Error when build project

Got problem when pulling the github.com/russross/blackfriday/v2 for building. Tried with both go get or local build:

Go Get:

» go get github.com/src-d/go-license-detector/...
package github.com/src-d/go-license-detector/cmd/license-detector
        imports github.com/russross/blackfriday/v2: cannot find package "github.com/russross/blackfriday/v2" in any of:
        /Users/ledongthuc/.gvm/gos/go1.11.5/src/github.com/russross/blackfriday/v2 (from $GOROOT)
        /Users/ledongthuc/.gvm/pkgsets/go1.11.5/global/src/github.com/russross/blackfriday/v2 (from $GOPATH)

Go Build:

» cd $GOPATH/src/github.com/src-d/go-license-detector
src-d/go-license-detector [master] » go mod vendor
go: modules disabled inside GOPATH/src by GO111MODULE=auto; see 'go help modules'
src-d/go-license-detector [master] » go build -v gopkg.in/src-d/go-license-detector.v2/cmd/license-detector

../../../gopkg.in/src-d/go-license-detector.v2/licensedb/internal/processors/markup.go:7:2: cannot find package "github.com/russross/blackfriday/v2" in any of:
        /Users/ledongthuc/.gvm/gos/go1.11.5/src/github.com/russross/blackfriday/v2 (from $GOROOT)
        /Users/ledongthuc/.gvm/pkgsets/go1.11.5/global/src/github.com/russross/blackfriday/v2 (from $GOPATH)

Go Env:

GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/ledongthuc/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/ledongthuc/.gvm/pkgsets/go1.12.5/global"
GOPROXY=""
GORACE=""
GOROOT="/Users/ledongthuc/.gvm/gos/go1.12.5"
GOTMPDIR=""
GOTOOLDIR="/Users/ledongthuc/.gvm/gos/go1.12.5/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/p8/14ky0v4n385_8kcx4zccl5_00000gn/T/go-build626311752=/tmp/go-build -gno-record-gcc-switches -fno-common"

Recursive scan of local project

Perhaps I am doing something wrong but GLD does not seem to recursively scan may vendor directory for licenses. Is that by design? I have a Godep setup to handle deps and I would like to ensure (without legal status) that the included deps are of a certain license.

I am now using a series of bash commands to extract the deps as seen by dep but it is a bit shaky.

Using the API causes build issues due to mismatched arguments

When invoking the go license detector as an API there are build errors that appear to be caused by the use of gopkg.in having incompatible versions bundled together. It is not clear what the compatible versions of referenced packages from the license detector library are.

Go 1.11
Not using go mod, using dep

Version : gopkg.in/src-d/go-license-detector.v2/licensedb

Code

import (
        "gopkg.in/src-d/go-license-detector.v2/licensedb"                                                                                            
        "gopkg.in/src-d/go-license-detector.v2/licensedb/filer"                                                                                      
)
...
                licenses, errGo := licensedb.Detect(filer.FromDirectory(dir))                                                                       
...

Build errors

vendor/gopkg.in/src-d/go-license-detector.v2/licensedb/filer/filer.go:212:19: assignment mismatch: 2 variables but 1 values
vendor/gopkg.in/src-d/go-license-detector.v2/licensedb/filer/filer.go:212:43: not enough arguments in call to filesystem.NewStorage
        have (sivafs.SivaFS)
        want (billy.Filesystem, cache.Object)

build issue

go get -v gopkg.in/src-d/go-license-detector.v2/...
gopkg.in/src-d/go-license-detector.v2/vendor/github.com/xanzy/ssh-agent
# gopkg.in/src-d/go-license-detector.v2/vendor/github.com/xanzy/ssh-agent
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:33:2: Pointer redeclared during import "unsafe"
        previous declaration during import "syscall"
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:33:2: imported and not used: "unsafe"
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:113:45: cannot convert ptr (type uintptr) to type syscall.Pointer
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:122:18: cannot convert &mapNameBytesZ[0] (type *byte) to type syscall.Pointer
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:125:68: cannot convert &cds (type *copyData) to type syscall.Pointer
..\..\..\go\src\gopkg.in\src-d\go-license-detector.v2\vendor\github.com\xanzy\ssh-agent\pageant_windows.go:144:42: cannot convert nameP (type *uint16) to type syscall.Pointer

Unable to install v3.1.0 in Mac

go build -v gopkg.in/src-d/go-license-detector.v3/cmd/license-detector
gopkg.in/src-d/go-git.v4/plumbing/transport/ssh

gopkg.in/src-d/go-git.v4/plumbing/transport/ssh

../../../go/pkg/mod/gopkg.in/src-d/[email protected]/plumbing/transport/ssh/common.go:147:15: undefined: proxy.Dial

Scan individual source code files

GLD supports LICENSE and README at the moment. Would be nice to scan source code files too.

  1. Find source code files using enry
  2. Parse comments in the headers
  3. Apply heuristics to split the text into license and description
  4. Query the license part using the existing functions

Build failure -- How to build?

Trying to build this from source fails:

~/Contributions/go-license-detector
 (master) 0
$ GOPROXY="" go build .
can't load package: package gopkg.in/src-d/go-license-detector.v2: unknown import path "gopkg.in/src-d/go-license-detector.v2": cannot find module providing package gopkg.in/src-d/go-license-detector.v2

How do you build this? I can see you are using Go modules but for some reason the import path defined in your go.mod go build can't find :/ wut?!

gets stuck running the checker

hey

At my project's root, I run:

export GO111MODULE=on
go mod download
go build -v gopkg.in/src-d/go-license-detector.v3/cmd/license-detector
./license-detector . --format json

but it stays there forever with output.. it looks blocked. is it supposed to take how long? it's a small project. thanks.

Add comparison to few other tools for license detection

On top of the awesome list that already exists in https://github.com/src-d/go-license-detector#quality it would be nice to include numbers for:

add a .sh script to reproduce those measurements, sort the resulting table i.e by detection rate and may be add a programming language column.

license for text

any reason not to resolve licenses from text in a public package?

Return matching license files

Currently, the result of both Detect and Analyse return only a mapping from spdix license id to a confidence in form of a float32.

It would be awesome, if there was some way to see which file caused the highest probability of a certain license id. I'd use this feature for also referring a user to a path where the license can be found. Also, it'd simplify further postprocessing on the license such as e.g. extracting copyright information etc.

Btw: Love this library, keep up the good work!
If help is wanted on this issue, I can also have a look into it.

go.sum is missing from the repository

Hi,

The go.sum file is needed when using go modules because it makes sure users build your program with the same modules you used. See this link for the rationale.

https://github.com/golang/go/wiki/Modules#how-to-prepare-for-a-release

Here are the steps to fix this. I can submit a pr if you want, but I thought it better to describe the steps in this situation.

  1. revert 4407ba1
  2. run go mod tidy
  3. commit go.sum to the repository.

Can you please make a new release with this fix as soon as possible?

Thanks for your time,

William

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.