Skip to content

License compliance

michelescarlato edited this page Jun 2, 2021 · 33 revisions

Introduction

This wiki page shows the License Compliance Verifier (LCV) logic and the Compatibility Matrix (CM), currently based on its demonstrator.

The current state of this demonstrator represents a snapshot of rules that focuses on the compatibility of licenses. This page will provide a brief overview of the current state of license documentation and compliance reporting.

ScanCode's author Ombredanne in his recent article "Free and open-source software license compliance: tools for software composition analysis" [1], presents a study [2] carried out on license documentation, revealing that fewer than 5% of approximately 5,000 popular free and open-source software (FOSS) packages contained complete and unambiguous license documentation. His paper presents an accurate analysis of techniques and tools to identify which FOSS components compose a package, provenance, and licenses. Furthermore, it presents a clear distinction between scanning and matching techniques, highlighting related advantages and drawbacks. This distinction makes them complementary to address problems associated with Software License Compliance (SLC).

Scanning techniques are adopted to retrieve: information from package manifests and build scripts; explicit license notices, tags, mentions, and texts; other provenance clues, such as emails, URLs, and specific code constructs, i.e., programming language imports, include statements, namespaces, and code tree structures. When provenance and license information is not provided, the scanning becomes useless, giving space to matching techniques, which promises to be an attractive area for new development.

From a developer's perspective, the accurate selection of open source components implies considering licensing issues, narrowing down the search by examining whether the project's license is compatible with a specific business model, mission, or other software used [3].

This demonstrator assumes that the project's knowledge base already contains license information related to a specific package version and shows how the License Compliance Verifier (LCV) performs compliance assessments.

The knowledge base should include license information related to each package. This information is called the outbound license. When a package includes other packages as dependencies, the outbound licenses of the dependencies will become the array of inbound licenses for the package. The LCV checks that each array element does not have compatibility issues with the outbound license declared for the final package.

To perform the license assessment, the LCV uses a Compatibility Matrix that stores compatibility rules. These assessments will fulfill the package metadata table of the Knowledge Base.

Technical details and implementation

LCV is written in Python and can be executed in two modes: a Command Line Interface (CLI) tool or a `flask` web server providing REST APIs.

The LCV source code is available on GitHub.

The whole process is composed of three main phases:

  • Retrieve or provide an outbound license,

  • Retrieve or provide one or more inbound licenses,

  • Perform the compatibility assessment, using the rules set on the Compatibility Matrix (CM).

Compatibility matrix

To develop our Compatibility Matrix (CM), we choose to take a subset of the compatibility rules expressed within the Open Source Automation Development Lab (OSADL) matrix . The CM comprises inbound licenses in the rows and outbound licenses in the columns. We expressed the associations between inbound and outbound licenses using `1s` and `0s`: `1` if the inbound is compatible with declared outbound and `0` otherwise. As employed by OSADL 4(#4) , we added the notation `II` for Insufficient Information cases and `DEP` for depending on the case. Our LCV algorithm provides a single output while interpreting these two particular notations, the `DUC` Depending on the Use Case output. We chose 1s and 0s to express compatibility rules because of the usage of Postgres within the Fasten Knowledge Base. Initially, we did not plan for having a `CSV` storing the CM, but it made it easier to create this demonstrator. We also wrote the SQL query to create a CM as a table and populate it with the `CSV` that LCV currently is using. Postgres has the so-called bit data field, which stores boolean values in `1` for `True` and `0` for `False`.

LCV, reading from the CSV, interprets 1 as True and 0 as False, making it still suitable to deploy the Compatibility Matrix on a Postgres table, but it should be further considered how to care the II and DEP.

Endpoints

Most of the endpoints provided by the LCV flask server are usable through GET and POST requests or via HTML forms. GET and POST requests can be performed, e.g., with Postman, cURL, or can be called within other source code.

We are performing these requests within the License Compliance Java plugin for integration with Fasten.

From now on, the suffix http://0.0.0.0:3251 to add before each endpoint is omitted. All the functions mentioned are located in /src/LCVlib/verify.py.

Retrieving the outbound license using GitHub APIs

There are two endpoints to retrieve outbound licenses from GitHub. One usable through an HTML page, and the other running a POST request.

Both endpoints make use of the retrieveOutboundLicense() function provided by the LCV libraries, which essentially parses the JSON retrieved, interrogating the GitHub APIs, and filters out the SPDX-id field.

Performing license compliance assessments - simple boolean output

Four endpoints provide a boolean output while performing license compliance assessments. /CompatibilityFlag and /CompatibilitySPDXFlag which are usable through HTML, and /LicensesInputFlag and /LicensesInputSPDXFlag that accept POST requests. The SPDX endpoints receive an array of inbound licenses in SPDX-id format, while the others refer to the Scancode name format. All the endpoints receive a single outbound license that should be in SPDX-id format.

  • /CompatibilityFlag is a link to an HTML form composed of two fields, one to insert an array of comma-separated inbound licenses (that also accepts the only element), accepting Scancode names, and another to insert the single outbound license in SPDX-id format.

  • /CompatibilitySPDXFlag accepts both the array of comma-separated inbound licenses and the single outbound license in SPDX-id format.

  • /LicensesInputFlag?InboundLicenses=FirstInboundLicense,SecondInboundLicense, ... nInboundLicense&OutboundLicense=UniqueOutboundLicenseSPDXid is a POST method that takes in input the array of comma separated inbound licenses (Scancode names) and the SPDX-id of a single outbound license.

  • /LicensesInputSPDXFlag?InboundLicenses=FirstInboundLicense,SecondInboundLicense, ... nInboundLicense&OutboundLicense=UniqueOutboundLicenseSPDXid is a POST method that takes in input the array of comma-separated inbound licenses (SPDX-id format) and the SPDX-id of a single outbound license. The image below shows a Postman call to this endpoint, passing MIT and Apache-2.0 as inbound licenses in SPDX id format and Apache-2.0 as the outbound license. The LCV provides True as an assessment result.

\begin{figure} \includegraphics[width=1\textwidth]{img/LicensesInputSPDXFlag.png} \caption{/LicensesInputSPDXFlag endpoint: SPDX IDs for MIT and Apache-2.0 as inbound licenses and Apache-2.0 as the outbound, True as assessment flag.} \end{figure}

The SPDX endpoints make use of the CompareSPDXFlag() function, while the Scancode endpoints make use of the CompareFlag() one. Both functions rely on the verifyFlag() function that performs a compatibility check for each element of the inbound licenses array against the outbound one. The verifyFlag() function retrieves compatibility information from the CSV representing the Compatibility Matrix (CM) finally providing a verification flag.

This verification flag will fulfill the compatibility assessment column of the Fasten Knowledge Base metadata table.

It results from processing an array of booleans containing the compatibility rules retrieved from the CM for each association between inbound and outbound licenses.

The LCV interprets the boolean array as follows:

  • if all the elements are True, the LCV will provide True as a final value, and the license assessment will be flagged as passed,
  • if only one of the elements is False, the LCV will provide False as a final value, and the license assessment will be flagged as not passed.

These endpoints perform the license compatibility assessment, but providing only boolean results cannot provide more detailed logs. Other endpoints will produce license compatibility logs.

Providing logs related to the license compliance assessments.

This section shows four endpoints that provide an assessment log. Their input is the same as the Flag endpoints described in section 2.2.2, namely an array of inbound licenses and a single outbound license. The output, instead of being a simple boolean value, is a more verbose log format.

The endpoints are the followings:

/Compatibility and /CompatibilitySPDX, and /LicensesInput and /LicensesInputSPDX that accept POST requests. The SPDX endpoints receive an array of inbound licenses in SPDX-id format, while the others refer to the Scancode name format. All the endpoints receive a single outbound license that should be in SPDX-id format.

  • /Compatibility and /CompatibilitySPDX give as output an assessment log showing compatibility information for the array of inbound licenses and the declared outbound one inserted via an HTML form.

  • /LicensesInput?InboundLicenses=FirstInboundLicense,SecondInboundLicense, ... nInboundLicense&OutboundLicense=UniqueOutboundLicenseSPDXid and /LicensesInputSPDX?InboundLicenses=FirstInboundLicense,SecondInboundLicense, ... nInboundLicense&OutboundLicense=UniqueOutboundLicenseSPDXid provides as output an assessment log related to compatibility information for the inbound license array matched against the outbound one, received via HTTP POST.

As for the endpoints described in section 2.2.2, where the endpoint has the SPDX keyword, accepts a list of inbound licenses in SPDX format. Where SPDX is not present, the inbound licenses will be composed of Scancode names. All these endpoints accept a single outbound license in SPDX format.

\begin{figure} \includegraphics[width=1\textwidth]{img/LicensesInput.png} \caption{/LicensesInput endpoint: SPDX IDs for MIT and Apache-2.0 as inbound licenses and Apache-2.0 as the outbound, verbose output as a response.} \end{figure}

The image above shows a Postman call to the /LicensesInput endpoint, passing Apache 2.0 and MIT License as inbound licenses in Scancode names and Apache-2.0 as the outbound license. The LCV provides a verbose output showing each inbound license compared with the outbound one.

Automated tests

LCV performs automated REST API tests implemented with `Newman`, an `npm` module that runs `Postman collections`. It runs two different CI/CD pipelines, one with `GitHub actions` and another with `Jenkins`, requiring separate Dockerfiles.

LCV performs automated REST API tests implemented with Newman, an npm module that runs Postman collections. It runs two different CI/CD pipelines, one with GitHub actions and another with Jenkins, requiring separate Dockerfiles. GitHub actions use Dockerfile, while Jenkins and the local deployment of the LCV flask server use DockerfileExternal. The GitHub actions workflow requires a different Dockerfile because different jobs run on different machines, preventing these jobs from reaching each other via pre-defined IP addresses. It could have been possible to implement a GitHub composite action, but the first action should terminate to allow the second to run. Unfortunately, in the LCV case, this approach would prevent testing the APIs on a running instance of the LCV flask server because the action performing tests would start only after the server terminates.

So this Dockerfile integrates both Python to run LCV and npm to run newman for the REST APIs tests.

It is worth mentioning that the LCV flask server uses the DockerfileExternal, which does not install npm dependencies, keeping the docker image lighter. Npm is not required for LCV functioning but only for testing purposes.

Currently, we perform tests using a Postman collection composed of 20 tests, half of them ensuring that all the tested endpoints are reachable and the other half controlling that the output provided by each endpoint corresponds to the expected one. We check the output using the pm object through test scripts written in JavaScript that runs after receiving the response.

Integration with Fasten

For the integration within Fasten, LCV will perform license-compliant assessments via REST APIs. A `flask` docker container will provide the APIs.

At the same time, the java license compliance plugin will call them passing an array of inbound and a single outbound license to perform an assessment.

Usage

The LCV demonstrator could be run by cloning the GitHub repository:
git clone [email protected]:endocode/LCV-CM.git

Building Docker image:

docker build --no-cache -t lcv-cm -f DockerfileExternal .

The repository contains two Dockerfiles because GitHub actions use Dockerfile to run the CI/CD pipeline, while another CI/CD pipeline implemented in Jenkins and the local deployment of the LCV flask server use DockerfileExternal.

And running the Docker container:

docker run -it -p 3251:3251 lcv-cm

The LCVServer will run at the 3251 port of your localhost and be reachable via http://0.0.0.0:3251/APIEndpoints.

Flask uses the address 0.0.0.0 to allow HTTP connections coming from outside the Docker container.

The /APIEndpoints endpoint is a collection of most of the endpoints provided by LCV.

Evaluation

We evaluated the `LCV demonstrator` collecting 32 GitHub repositories with the following characteristics:
  • A pom.xml file must be present in their root directory.
  • Smaller than 30 MB.
  • At least 51% Java.

We selected a total of 94 repositories, trying to generate a JSON report for each of them employing the open-source tool QMSTR. Only 32 of them could be scanned without producing errors, revealing how difficult it is for scanning tools to detect license information, partially because the repositories did not provide clear license information.

We showcased how LCV performs assessments using two functions explicitly written to scan all the JSON reports and extract inbound licenses from each report, namely JSONPathList() and RetrieveInboundLicenses().

Notwithstanding that LCV offers REST APIs, we conducted the evaluation using the CLI. We used mainly two python scripts, main.py and tests.py. Both rely on the same set of libraries in the src/LCV/LCVlib/ directory.

To collect inbound licenses in SPDX format, we wrote the SPDXIdMapping() function that using an external CSV maps license name used within QMSTR to SPDX-id.

We also wrote two libraries supporting this evaluation phase: LCVlib/testlistsJSONfiles.py and LCVlib/testlistsGitHubAPI.py. These libraries create two lists, one list of JSON reports and a list of outbound licenses retrieved from GitHub APIs. Both lists follow the same project order, enabling a single index to match the same project in both lists.

verify.py contains the main functions commonly used from the scripts mentioned above. The LCVServer.py also uses verify.py to provide REST API with Flask, but we did not use it for this evaluation.

To use the main.py against a single repository, we run this command: python3 main.py 1 where 1 is the index corresponding to the second JSON report listed in the testlistsJSONfiles.py.

The tests.py python script loops each report and shows the result at the console, ranging the entire lists from 0 to 31. To run it, simply: python3 tests.py

Results

LCV provides three output flags: True, False `DUC`.

True when a project is compliant, False when it is not, and DUC (depending on the use case). Trusting QMSTR JSON reports for the inbound licenses and GitHub APIs for the outbound ones, we found out that:

  • 24 projects flagged as True,
  • 1 project flagged as DUC felling into Depending on the Use Case (DUC) category, because of an Apache-2.0 as inbound and the EPL-2.0 as outbound,
  • 7 flagged as False, where:

Among the 7 flagged as False:

  • 2 did not pass the assessment because one of the inbounds was not compatible with the outbound.
  • 1 had an unknown inbound license that could not be matched against the outbound one.
  • LCV marked 4 of them not compliant because GitHub APIs could not provide precise information upon the outbound license, using a generic NOASSERTION label, preventing LCV from performing the assessment. These 4 projects were two triple, one double, and one single inbound license.

Considering the number of inbound licenses:

  • 24 repositories are single inbound license projects; and 22 of them have the same inbound and outbound license,
  • 5 are double inbound license; and only two of them have the outbound license included in the set of inbounds,
  • 3 are triple inbound licenses, supposedly all of them with the outbound included in the set of inbounds. Still, for two of them, GitHub APIs reported NOASSERTION as the outbound license.

By conducting these experiments and developing the LCV demonstrator, we propose our approach for performing an assessments-based license compliance verification when a set of inbound licenses a single outbound license is provided.

The proposed approach relies on a subset of the license compatibilities adopted by the OSADL, representing a snapshot of the current state of the art in the Open Source Software (OSS) license compliance.

Technical considerations

Since Fasten uses a volatile approach for storing information inserted into the knowledge base, the proposed procedure could not be very accurate.

The license detection approach that we will implement lays the basis for retrieving dependencies licenses information, the so-called inbound licenses, from the local Pom.xml file, comparing them with those in the Pom.xml hosted in Maven central. Whenever the same package version is also present as a GitHub repository, we double-check the outbound license by querying the GitHub APIs.

When analyzing a specific package, apart from looking at the local Pom.xml and the one hosted in Maven central and querying the GitHub APIs, we run tools that implementing scanning techniques search for licenses tag inside of the whole package.

Within our license detection, we mainly implement scanning techniques.

Considering a best-case scenario, the whole scanned project and the licenses declared for the dependencies within the Pom.xml file do not present compatibility issues with the declared outbound license. Nevertheless, through another assessment, whenever we discover that a specific version of a dependency has license compatibility issues, to preserve this precious information that could be used in the future or change previous assessments, the Fasten knowledge base should not be volatile.

It is possible that performing license detection on a project that includes a dependency declared under a particular outbound license will not reveal license compatibility issues. However, it is also possible that performing license detection upon that particular dependency, considered as the final package, reveals that it has compatibility issues, mainly if the license detection phase performs matching techniques.

In [1], Ombredanne provides a detailed explanation about the differences between scanning and matching. The latter substantially matches a portion of source code, using hash functions, against well-known databases (e.g., Software Heritage[5][6], Clearly Defined[7]) to check if someone else has already released that code under a particular license.

This technique would detect portions of code inserted in a final package without considering the license under which it released the code.

To use license information discovered in the aftermath, it should be stored in a non-volatile database, enabling the License Compliance plugin to retrieve it. While performing License Compliance, apart from the information retrieved from the analyzed package, external information in the knowledge base upon the same dependency version would help perform a more accurate assessment.

Furthermore, without storing license information in the knowledge base permanently, it would make it impossible to "break the chain of license issues" triggered by that specific dependency version found not compliant in other assessments. We cannot see how to propagate the notification to all the packages using it because the architecture flushed previous license information, preventing us from knowing which already scanned package used the same dependency version.

The logic that we wanted to express here is that we should keep it in mind to modify an assessment while discovering new license incompatibilities.

Unknown or misinterpreted licenses recognized by scanning tools.

During this evaluation, we met two particular cases that revealed how tricky it could be for scanning tools to recognize a specific license precisely.

This case is related to two files in two different repositories detected with Unknown and Public Domain licenses. We double-checked both cases also with the Scancode tool, finding that the Unknown case is circumscribed to the QMSTR detection, while the Public Domain case is common to both.

The repository TeamSpeak-3-Java-API resulted in an Unknown license because in a file declared with the MIT license text, but without the MIT keyword, removed as suggested within the best practice by the license itself, the keyword unknown occurred twice inside of the code (once inserted as a comment, and the other used to return the VirtualServerStatus.UNKNOWN value)[8].

For this specific case, we assumed that QMSTR could not recognize the MIT License text at the beginning of the file, and searching for a license inside it, found the tricky word unknown, detecting it as an Unknown license.

The repository hope-boot resulted in both adopted scanning tools, with a Public Domain license, a license category that groups public domain licenses because they are no longer subject to copyright. The license detection should result in the specific license name that belongs to the public domain and should not be detected as Public Domain. Other files were detected within the same repository with a specific license name, then further labeled as a Public Domain license. An example is a file [9] licensed under WTFPL 2.0 (spdx-id: WTFPL), then categorized as a Public Domain license. The file contains the WTFPL key several times and does not contain any public or domain keyword.

Nevertheless, a specific file, AbstractCrudService.java, has been marked only with the Public Domain license key because inside of it, there are 6 instances of the string public DOMAIN, which stands for a public method implementing a <DOMAIN> class [10]. This file does not contain any explicit license declaration inside.

The similarity between these two particular cases relies on the fact that both do not contain precise information regarding a license declared for the file in question. It is true that the Unknown case declared an MIT license correctly and that the license itself could be tricky to be recognized if part of the license text has been modified. Nevertheless, it is also clear that not providing clear and unambiguous license information at the top of each file could bring misinterpretation to the code, recognizing licenses that possibly exist but are not declared for that specific file (e.g., the public DOMAIN case).

This kind of specific case also affected the development of the LCV demonstrator. While we found it easy to treat particular cases of outbound licenses as an exception, e.g., the NOASSERTION nomenclature used by GitHub APIs when a repository does not release an explicit license declaration; we found it more challenging to treat these cases for the inbound licenses.

We prevented adding a row and a column for the particular outbound case Compatibility Matrix because we could prevent the LCV logic from running when an accepted outbound license is missing.

On the other hand, we suffered the inclusion of incorrect inbound licenses for two reasons. The first is that developing LCV, we trusted the license detection phase, assuming that the scanning tools were retrieving license information generally accepted. The second one is the production of logs related to the LCV verbose assessment. The repositories released under a single outbound license may have within the set of inbound licenses one or more licenses that are not compatible with the outbound one. While developing, we felt that it is our responsibility to provide an accurate assessment, providing precise information related to each license association. The first approach has been to insert all the licenses we found within the scanned repositories in our matrix. We did not consider filtering out any of them and providing a separate output regarding a not valid inbound license, which seems now the right direction to follow. It is essential to mention that when we started to write the code of the LCV demonstrator, we did not have the precious support of the OSADL matrix.

To treat these two exceptional cases that the OSADL matrix has not treated, we added two rows and two columns in the matrix, handling an individual output for each association. Initially, looking at a few of the license dependencies graphs [11][12], we relied upon the existence of a license interpreted as Public Domain, but we realized the tricky output provided by both scanning tools only after confronting our matrix with the one proposed by OSADL, which has been released only very recently.

Looking in the aftermath, we approached populating the matrix incorrectly by inserting similar borderline cases. As a reasonable next step to perform, we should handle the insertion of this output provided by famous scanning tools throwing an exception, warning the users that these kinds of licenses should not appear within the inbound licenses, further indicating where this detection has occurred.

The Compatibility Matrix here presented, for visualization purposes, has been generated, removing the columns related to the Unknown case. However, we kept the public domain one, assuming that it is always compatible as an inbound license but not compatible with any other more restrictive license as an outbound one.

However, as it is possible to observe from the matrix that we are using in the GitHub repository [13], we treated the Unknown case fulfilling with UNK each association to provide an output for each LCV's iteration that match with an Unknown inbound license.

Next steps

As a next step, we need to consider how to integrate the LCV within the Fasten architecture. It would be essential to understand if it could run as a docker container that does not necessarily die after each execution.

Another option would be to translate the LCV logic entirely from Python to Java, but it would require an avoidable effort if the docker container approach is valid.

Regarding the license detection phase, we should consider a functional approach applicable to the Fasten architecture. The one adopted for this demonstrator, namely using QMSTR, cannot be integrated.

References


[1] P. Ombredanne, “Free and open source software license compliance: Tools for software composition analysis,” Computer, vol. 53, no. 10, pp. 105–109, 2020.
[2] P. Ombredanne and D. Clark, “What is the state of open source license clarity?” ClearlyDefined. Apr-2019.
[3] D. Spinellis, “How to select open source components,” Computer, vol. 52, no.12, pp. 103–106, 2019.
[4] “OSADL open source license checklists obligations project - compatibility matrix.” https://www.osadl.org/fileadmin/checklists/matrix.html .
[5] “TeamSpeak-3-java-api/serverqueryinfo.java at 68807a93de23cd3fde1efdf36c29823992fd72ce · theholywaffle/teamspeak-3-java-api.” https://github.com/TheHolyWaffle/TeamSpeak-3-Java-API/blob/68807a93de23cd3fde1efdf36c29823992fd72ce/src/main/java/com/github/theholywaffle/teamspeak3/api/wrapper/ServerQueryInfo.java#L172-L184
[6] “Hope-boot/jquery.nouislider.min.js at master · hope-for/hope-boot.” https://github.com/hope-for/hope-boot/blob/master/hope-admin/src/main/resources/static/js/plugins/nouslider/jquery.nouislider.min.js
[7] “Hope-boot/abstractcrudservice.java at d90cecad037f1d68393d68d400831830f0334ca6 · hope-for/hope-boot.” https://github.com/hope-for/hope-boot/blob/d90cecad037f1d68393d68d400831830f0334ca6/hope-framework/src/main/java/com/hope/jpa/service/impl/AbstractCrudService.java
[8] G. M. Kapitsaki, F. Kramer, and N. D. Tselikas, “Automating the license compatibility process in open source software with SPDX,” Journal of Systems and Software, vol. 131, pp. 386–401, 2017.
[9] “License compatibility - wikipedia.” https://en.wikipedia.org/wiki/License_compatibility .
[10] “LCV-cm/licenses_tests.csv at main · endocode/lcv-cm.” https://github.com/endocode/LCV-CM/blob/main/csv/licenses_tests.csv .
[11] R. Di Cosmo and S. Zacchiroli, “Software Heritage: Why and how to preserve software source code,” in IPRES 2017-14th international conference on digital preservation, 2017, pp. 1–10.
[12] A. Pietri, D. Spinellis, and S. Zacchiroli, “The software heritage graph dataset: Public software development under one roof,” in 2019 IEEE/ACM 16th international conference on mining software repositories (msr), 2019, pp. 138–142.
[13] “ClearlyDefined.” https://clearlydefined.io/ .