GithubHelp home page GithubHelp logo

nexb / debian-inspector Goto Github PK

View Code? Open in Web Editor NEW
13.0 7.0 6.0 2.41 MB

A python library to parse Debian deb822-style control and copyright files and all related Debian, Ubuntu and Debian-derivative manifest and metadata files, an alternative approach to python-debian.

Python 94.77% Batchfile 2.12% Shell 2.63% Makefile 0.49%
deb822 debian dep5 debian-control python-debian debian-packages debian-packaging dpkg ubuntu debian-copyright

debian-inspector's People

Contributors

agustinhenze avatar arijitde92 avatar arnav-mandal1234 avatar ayansinhamahapatra avatar chinyeungli avatar commod0re avatar dkg avatar dotlambda avatar dsoprea avatar j08ny avatar johnmhoran avatar jonoyang avatar keshav-space avatar mjherzog avatar omkarph avatar palbee avatar pombredanne avatar ppinard avatar stephanlachnit avatar steven-esser avatar swastkk avatar tg1999 avatar xolox avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

debian-inspector's Issues

2 blank lines instead of 1 between paragraphs fails paragraph parsing

When there's multiple lines between paragraphs in a debian copyright file, they aren't correctly detected as CopyrightFilesParagraph/CopyrightLicenseParagraph but instead by CatchAllParagraph, as 'other' detected paras.

All cases where other paras were detected in tests with structured copyright files in tests/packagedcode/test_debian_copyright were of this caregory, i.e this is a very frequently encountered problem.

Some examples but not limited to:-

  • debian-slim-2021-04-07/usr/share/doc/gpgv/copyright#L113
  • debian-slim-2021-04-07/usr/share/doc/util-linux/copyright#341
  • debian-slim-2021-04-07/usr/share/doc/perl-base/copyright#548
  • debian-slim-2021-04-07/usr/share/doc/libuuid1/copyright#342
  • debian-slim-2021-04-07/usr/share/doc/libnsl2/copyright#172
  • debian-slim-2021-04-07/usr/share/doc/mount/copyright#342
  • debian-slim-2021-04-07/usr/share/doc/gpgv/copyright#113
  • debian-slim-2021-04-07/usr/share/doc/libsmartcols1/copyright#341
  • debian-slim-2021-04-07/usr/share/doc/libmount1/copyright#342
  • debian-slim-2021-04-07/usr/share/doc/libp11-kit0/copyright#30
  • debian-slim-2021-04-07/usr/share/doc/liblz4-1/copyright#59
  • debian-slim-2021-04-07/usr/share/doc/libblkid1/copyright#342
  • debian-slim-2021-04-07/usr/share/doc/bsdutils/copyright#342
  • debian-2019-11-15/main/c/clamav/stable_copyright#16

The paragraph seperation is handled in this function

def split_in_paragraphs(text):

Determine the primary license from a copyright file

This should be based on the Files: * present in a paragraph.
This would become really handy as Debian copyright are erring on the comprehensive and verbose side. And knowing the primary license would help deal with the data density better!

debut-0.9.4 not detecting GPLv2 from copyright texts

Tern uses the debut package to parse debian copyrights and find package licenses. I understand that debut is now debian-inspector but as far as I can tell, the code is the same at the moment so I am opening an issue in this repo. Debut is not finding a license for the following copyright text (libgpm2copyright.txt) from the libgpm2 package. Here's what we're doing to collect the licenses that doesn't yield any results:

>>> from debut import debcon
>>> from debut import copyright as debut_copyright

>>> with open('libgpm2copyright.txt') as file:
...     libgpm2copy = file.read()

>>> collected_paragraphs = list()
>>> for paragraph in iter(debcon.get_paragraphs_data(libgpm2copy)):
...     if 'license' in paragraph:
...             cp = debut_copyright.CopyrightLicenseParagraph.from_dict(paragraph)
...             collected_paragraphs.append(cp)
>>> collected_paragraphs
[CopyrightLicenseParagraph(license=LicenseField(name='', text=None), comment=FormattedTextField(text=None), extra_data={})]


>>> deb_pkg_data = debut_copyright.DebianCopyright(collected_paragraphs).to_dict()
>>> deb_pkg_data
{'paragraphs': [{'license': '', 'comment': ''}]}

Is it possible for this text to be parse-able for licenses by debian-inspector?

encoding issue when using `copyright.DebianCopyright.from_file`

I created parser using the debut library with the following code

data = copyright.DebianCopyright.from_file(input).to_dict()

However, this line of code fail with encoding error for some files

'ascii' codec can't decode byte 0xe2 in position 17: ordinal not in range(128)

Some sample files are:
https://changelogs.ubuntu.com/changelogs/pool/main/d/dpkg/dpkg_1.19.0.5ubuntu2/copyright
https://changelogs.ubuntu.com/changelogs/pool/main/d/dpkg/dpkg_1.19.0.5ubuntu2/copyright

Some tests fail on Python 3.9 with `TypeError: __init__() got an unexpected keyword argument 'encoding'`

___________ TestDebian822.test_Debian822_from_file__signed_from_dsc ____________

self = <test_debcon.TestDebian822 testMethod=test_Debian822_from_file__signed_from_dsc>

    def test_Debian822_from_file__signed_from_dsc(self):
        test_file = self.get_test_loc('debcon/deb822/zlib_1.2.11.dfsg-1.dsc')
        expected_loc = 'debcon/deb822/zlib_1.2.11.dfsg-1.dsc-expected-deb822.json'
        results = debcon.Debian822.from_file(test_file).to_dict()
>       self.check_json(results, expected_loc, regen=False)

tests/test_debcon.py:139:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_utils.py:35: in check_json
    expected = json.load(ex, encoding='utf-8')
/nix/store/y1mhcjrx951lbdbf69v2wng33gc7lx1j-python3-3.9.2/lib/python3.9/json/__init__.py:293: in load
    return loads(fp.read(),

[snip]

E       TypeError: __init__() got an unexpected keyword argument 'encoding'

/nix/store/y1mhcjrx951lbdbf69v2wng33gc7lx1j-python3-3.9.2/lib/python3.9/json/__init__.py:359: TypeError

[snip]

FAILED tests/test_contents.py::TestContentsParse::test_parse_contents_debian_no_header_gzipped
FAILED tests/test_contents.py::TestContentsParse::test_parse_contents_ubuntu_with_header_plain
FAILED tests/test_copyright.py::TestDebianCopyright::test_DebianCopyright_from_file__from_copyrights_dep5_1
FAILED tests/test_copyright.py::TestDebianCopyright::test_DebianCopyright_from_file__from_copyrights_dep5_3
FAILED tests/test_copyright.py::TestDebianCopyright::test_DebianCopyright_from_file__from_copyrights_dep5_dropbear
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraph_data_from_file__signed_from_dsc_can_remove_signature
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraph_data_from_file__signed_from_dsc_does_not_crash_if_signature_not_removed
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraph_data_from_file_from_status
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraph_data_from_file_from_status_can_handle_perl_status
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_copyrights_dep5_1
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_copyrights_dep5_3
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_copyrights_dep5_dropbear
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_packages
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_sources
FAILED tests/test_debcon.py::TestGetParagraphData::test_get_paragraphs_data_from_file__from_status
FAILED tests/test_debcon.py::TestDebian822::test_Debian822_from_file__from_status
FAILED tests/test_debcon.py::TestDebian822::test_Debian822_from_file__signed_from_dsc

All is fine on Python 3.8.

Error when using `get_license_detection_from_nameless_paragraph()`

I was running the scancode.io Docker pipeline on a Docker image and I got the following error:

'license'

Traceback:
  File "/app/scanpipe/pipelines/__init__.py", line 115, in execute
    step(self)
  File "/app/scanpipe/pipelines/docker.py", line 93, in collect_and_create_system_packages
    docker.scan_image_for_system_packages(self.project, image)
  File "/app/scanpipe/pipes/docker.py", line 166, in scan_image_for_system_packages
    for i, (purl, package, layer) in enumerate(installed_packages):
  File "/usr/local/lib/python3.9/site-packages/container_inspector/image.py", line 446, in get_installed_packages
    for purl, package in layer.get_installed_packages(packages_getter):
  File "/app/scanpipe/pipes/debian.py", line 34, in package_getter
    for package in packages:
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian.py", line 178, in get_installed_packages
    dc = debian_copyright.parse_copyright_file(copyright_location)
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian_copyright.py", line 95, in parse_copyright_file
    dc = StructuredCopyrightProcessor.from_file(
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian_copyright.py", line 350, in from_file
    dc.detect_license()
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian_copyright.py", line 590, in detect_license
    files_license_detections = self.get_license_detections(
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian_copyright.py", line 618, in get_license_detections
    get_license_detection_from_nameless_paragraph(paragraph=paragraph)
  File "/usr/local/lib/python3.9/site-packages/packagedcode/debian_copyright.py", line 1571, in get_license_detection_from_nameless_paragraph
    start_line, _ = paragraph.get_field_line_numbers('license')
  File "/usr/local/lib/python3.9/site-packages/debian_inspector/copyright.py", line 182, in get_field_line_numbers
    return self.line_numbers_by_field[field_name]

I don't have the full stack trace, but it appears that issue is that we're trying to get the value for the key license at https://github.com/nexB/debian-inspector/blob/main/src/debian_inspector/copyright.py#L182, but the key license does not exist in self.line_numbers_by_field.

"De-deb" some texts, such as Package descriptions

Debian uses space-prefixed continuation lines and "space-dot" continuation empty lines. We should have a function to strip these and optionally "re-deb" these
See for instance:

 GNU Bourne Again SHell
 Bash is an sh-compatible command language interpreter that executes
 commands read from the standard input or from a file.  Bash also
 incorporates useful features from the Korn and C shells (ksh and csh).
 .
 Bash is ultimately intended to be a conformant implementation of the
 IEEE POSIX Shell and Tools specification (IEEE Working Group 1003.2).
 .
 The Programmable Completion Code, by Ian Macdonald, is now found in
 the bash-completion package.

Separating different kind of paragraphs should be moved into debian_inspector

In scancode-toolkit, at src/packagedcode/debian_copyright.py there's the DebianCopyrightParagraphs clas which basically extends DebianCopyright at debian_inspector/copyright.py to also handle seperating the different types of paragraph in functional groups to be then used for parsing and copyright/license detection.

This should be moved to debian_inspector/copyright.py as it only extends functionality of the DebianCopyright class.

https://github.com/nexB/scancode-toolkit/blob/2390-improve-debian-license-detection/src/packagedcode/debian_copyright.py#L739

Some installed files not reported

For a status package such as:

Package: libc6
Status: install ok installed
Priority: required
Section: libs
Installed-Size: 10954
Maintainer: Ubuntu Developers <[email protected]>
Architecture: amd64
Multi-Arch: same
Source: glibc
Version: 2.23-0ubuntu11.3
Replaces: libc6-amd64
Depends: libgcc1
Suggests: glibc-doc, debconf | debconf-2.0, locales
Breaks: hurd (<< 1:0.5.git20140203-1), libtirpc1 (<< 0.2.3), locales (<< 2.23), locales-all (<< 2.23), lsb-core (<= 3.2-27), nscd (<< 2.23)
Conffiles:
 /etc/ld.so.conf.d/x86_64-linux-gnu.conf 593ad12389ab2b6f952e7ede67b8fbbf
Description: GNU C Library: Shared libraries
 Contains the standard libraries that are used by nearly all programs on
 the system. This package includes shared versions of the standard C library
 and the standard math library, as well as many others.
Homepage: http://www.gnu.org/software/libc/libc.html
Original-Maintainer: GNU Libc Maintainers <[email protected]>

The list of installed files at

  • /var/lib/dpkg/info/libc6:amd64.list

may be missed

Note also the that same status file may contain

Package: libc6
Status: install ok installed
Priority: required
Section: libs
Installed-Size: 9587
Maintainer: Ubuntu Developers <[email protected]>
Architecture: i386
Multi-Arch: same
Source: glibc
Version: 2.23-0ubuntu11.3
Replaces: libc6-i386, libc6-xen
Provides: libc6-i686, libc6-xen
Depends: libgcc1
Suggests: glibc-doc, debconf | debconf-2.0, locales
Breaks: hurd (<< 1:0.5.git20140203-1), libtirpc1 (<< 0.2.3), locales (<< 2.23), locales-all (<< 2.23), nscd (<< 2.23)
Conflicts: libc6-xen
Conffiles:
 /etc/ld.so.conf.d/i386-linux-gnu.conf 1c63da36f33ec6647af1d8faff9b9795
Description: GNU C Library: Shared libraries
 Contains the standard libraries that are used by nearly all programs on
 the system. This package includes shared versions of the standard C library
 and the standard math library, as well as many others.
Homepage: http://www.gnu.org/software/libc/libc.html
Original-Maintainer: GNU Libc Maintainers <[email protected]>

We have also all these other files in /var/lib/dpkg/info/libc6

libc6:amd64.conffiles    libc6-dbg:amd64.md5sums  libc6:i386.postrm
libc6:amd64.list         libc6-dev:amd64.list     libc6:i386.preinst
libc6:amd64.md5sums      libc6-dev:amd64.md5sums  libc6-i386.shlibs
libc6:amd64.postinst     libc6-i386.conffiles     libc6:i386.shlibs
libc6:amd64.postrm       libc6:i386.conffiles     libc6-i386.symbols
libc6:amd64.preinst      libc6-i386.list          libc6:i386.symbols
libc6:amd64.shlibs       libc6:i386.list          libc6:i386.templates
libc6:amd64.symbols      libc6-i386.md5sums       libc6-i386.triggers
libc6:amd64.templates    libc6:i386.md5sums       libc6:i386.triggers
libc6:amd64.triggers     libc6:i386.postinst      
libc6-dbg:amd64.list     libc6-i386.postrm        

Add documentation

Reported by @elear
This project could benefit of some documentation, if anything a code-derived sphinx-generated API doc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.