License Detection Issue Types
There are 4 types of license information as segregated by scancode-toolkit, based on rule types.
text
notices
tags
references.
Note that this is the order of precedence, as in in a file-region if there’s one type higher in this order and other several types lower in the order, the issue in the file-region is likely in the former type. For example if one of the matches is matched to a notice rule and other matches to reference rules, then the file-region in question is likely a License Notice.
These carry different importance of the license information present, and thus in case of license scan errors, there are fundamentally different types of problems which may require different approaches to solve them.
These are primarily segregated by the type of the rule matched, having the largest
matched_length
value, as a preliminary approach. It could also be determined by
NLP Sentence Classifiers (BERT), fine-tuned on the Scancode Rules.
Result Attributes
In the results of the analysis, the attributes having this information is issue_type
.
This further has many attributes:
classification_id
andclassification_description
4 boolean fields
is_license_text
,is_license_notice
,is_license_tag
, andis_license_reference
.
is_suggested_matched_text_complete
andanalysis_confidence
Here, the classification_id
attribute is a id which corresponds to a issue type from
All Issue Types, which the license detection issue is
classified into. The classification_description
describes the issue_type to provide
more information and context about the analysis.
There are 4 main types of issues in issue_type and these correspond to the 4 boolean flags in scancode rules:
is_license_text
- License Textsis_license_notice
- License Noticesis_license_tag
- License Tagsis_license_reference
- License References
Now the analysis_confidence
is an approximate measure of how accurate the classification into
these issue_types are (and not a measure of whether it is an issue or not). It has 3 values:
high, 2. medium and 3. low
In many cases, and mostly in cases of a new license text, there are significant differences
between already seen licenses and this new license. So as a consequence, all the matched fragments
if stitched together, doesn’t contain the whole text. This is_suggested_matched_text_complete
attribute has this information.
Note
Now only issues with is_license_text as True has it’s is_suggested_matched_text_complete value as false.
All Issue Types
|
|
|
---|---|---|
|
|
The matched text is present in a file whose name is a known legal filename. |
|
|
The matched license text is present in a file whose name is not a known legal filename. |
|
|
Only parts of a larger license text are detected. |
|
|
A notice with a complex license expression (i.e. exceptions, choices or combinations). |
|
|
A notice with a single license. |
|
|
License notices with unknown licenses detected. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
A part of a license tag is detected |
|
|
A new/common structure of tags are detected with scope for being handled differently. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
Lead-ins to known license references are detected. |
|
|
License references with a incomplete match. |
|
|
Matched to an unknown rule as the license information is present in another file, which is referred to in this matched piece of text. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
A piece of common introduction to a license text/notice/reference is detected. |
License Texts
All the issue_types with is_license_text as True.
License Text Files
Note
Value of issue_type:classification_id
:- text-legal-lic-files
[More Than 90% License Words/Legal File]
Here the “is_license_text” plugin is used to detected if it’s a License File or Not, also “is_legal” can be used for the detection, so an OR operation between these two cases.
So, if the full text is there in the “matched_text” we can go ahead and craft the rule from the
matched_text
.
License Texts in Files
Note
Value of issue_type:classification_id
:- text-non-legal-lic-files
[with less than 90% License Words]
In some cases, one of the “is_license_text” and “is_legal” tags, or even both could be False, and it still could be classified as a License Text because
the Rule it was partially matched was a license text rule
the
license-type
sentence classifier designated it as a license text
Note: In this case how “is_license_text” and “is_legal” is calculated could be updated, based on common mistakes.
Full text doesn’t exist in matched_text
Note
Value of issue_type:classification_id
:- text-lic-text-fragments
Where the Full text doesn’t exist in matched_text and we have to go to/fetch the source file which was scanned.
This is a common occurance in new unique license texts, which aren’t fully present. Normally these
are detected by the 3-seq
matcher stage.
On scanning License Texts Present in scancode, by reindexing the license index to the state before that particular text was added, we can see how the scan results look when entirely new license texts are encountered.
So it seems as the license text is large, and varies a lot from already existing license texts, the entire text doesn’t exist inside “matched_text”, so we have to go to the source file which was scanned and add it from there.
For example these are the results for the “cern-ohl-w-2.0.LICENSE” file scanned by taking scancode to a state where it wasn’t added.
Scan Result File has multiple partial matches
” it applies as licensed under CERN-OHL-S or CERN-OHL-W”
” licensed under CERN-OHL-S or CERN-OHL-W as appropriate.”
” licensed under a licence approved by the Free Software”
” interfaced, which remain licensed under their own applicable”
” direct, indirect, special, incidental, consequential, exemplary,n punitive or other damages of any character including, withoutn limitation, procurement of substitute goods or services, loss ofn use, data or profits, or business interruption, however causedn and on any theory of contract, warranty, tort (includingn negligence), product liability or otherwise, arising in any wayn in relation to the Covered Source, modified Covered Sourcen and/or the Making or Conveyance of a Product, even if advised ofn the possibility of such damages, and You shall hold the”
” 7.1 Subject to the terms and conditions of this Licence, each”
” You may treat Covered Source licensed under CERN-OHL-W as”
” licensed under CERN-OHL-S if and only if all Available”
Clearly the actual license has a lot more text, which we can only get by going to the source.
License Notices
All issue_types with their is_license_notice value as True.
Exceptions, Rules with Keys having AND/OR
Note
Value of issue_type:classification_id
:- notice-and-or-with-notice
Where there are multiple “notice” license detections, not of the same license name, in a single file. These are often:
dual licenses
exceptions
These have multiple license detections and some times new combinations are detected, and has to be added to the Rules.
Single key notices
Note
Value of issue_type:classification_id
:- notice-single-key-notice
This is the general case of License Notice cases, so if it’s a license notice case and doesn’t fall into the other license notice cases detailed below, then it belongs in this category.
These are often detected as License Notices are often unique in projects, and for these rules can be crafted with fairly high confidence as almost always the entire text is present in “matched_text”.
License References
All the issue_types with is_license_reference as True.
Those with low match coverages
Note
Value of issue_type:classification_id
:- reference-low-coverage-refs
This is the most common type of license detection errors, as there exist a lot of
license references, and they can be added. These are also highly fixable problems, as almost always
the whole license reference is captured in matched_text
We should separate these location wise, and add as new rules without any manual oversight.
This is the general case of License Reference cases, so if it’s a license reference case and doesn’t fall into the other license reference cases detailed below, then it belongs in this category.
unknown file license references
Note
Value of issue_type:classification_id
:- reference-to-local-file
In many cases the license that is referred to is in another file, and only the filename is given, and not the license name. Example - “see license in file LICENSE.txt”
In these cases if there are more context/specific wording add these as new unknown rules.
So we separate these based on their matched_rules, i.e. if these are matched to an “unknown” or similar kinds of non-explicitly named rules.
Other wise discard, as this is a issue to be handled separately, by implementing a system in scancode where these links are followed and their license added.
Introduction to a License Notice
Note
Value of issue_type:classification_id
:- reference-lead-in-or-unknown-refs
There are cases where the RULE name begins with lead-in_unknown_
, i.e. these are known lead-ins
to licenses, so even if the exact license isn’t detected, it can be reported that there is a
license reference here.
Here we could add to the Scancode Rules, the license reference, or as in the example case below, craft a new rule by joining the two existing ones
Example case:-
Dual licensed under
is lead-in_unknown_30.RULE
say there is another rule: MIT and GPL
and the text we scan is : Dual licensed under MIT and GPL
To Note: If they appear quite frequently, it is okay to craft a new rule. Because we cannot just add all combinations of lead-ins and license names.