License Detection Issue Types¶
There are 4 types of license information as segregated by scancode-toolkit, based on rule types.
text
notices
tags
references.
Note that this is the order of precedence, as in in a file-region if there’s one type higher in this order and other several types lower in the order, the issue in the file-region is likely in the former type. For example if one of the matches is matched to a notice rule and other matches to reference rules, then the file-region in question is likely a License Notice.
These carry different importance of the license information present, and thus in case of license scan errors, there are fundamentally different types of problems which may require different approaches to solve them.
These are primarily segregated by the type of the rule matched, having the largest
matched_length
value, as a preliminary approach. It could also be determined by
NLP Sentence Classifiers (BERT), fine-tuned on the Scancode Rules.
Result Attributes¶
In the results of the analysis, the attributes having this information is issue_type
.
This further has many attributes:
classification_id
andclassification_description
4 boolean fields
is_license_text
,is_license_notice
,is_license_tag
, andis_license_reference
.
is_suggested_matched_text_complete
andanalysis_confidence
Here, the classification_id
attribute is a id which corresponds to a issue type from
all possible issue types, which the license detection issue is
classified into. The classification_description
describes the issue_type to provide
more information and context about the analysis.
There are 4 main types of issues in issue_type and these correspond to the 4 boolean flags in scancode rules:
is_license_text
- License Textsis_license_notice
- License Noticesis_license_tag
- License Tagsis_license_reference
- License References
Now the analysis_confidence
is an approximate measure of how accurate the classification into
these issue_types are (and not a measure of whether it is an issue or not). It has 3 values:
high, 2. medium and 3. low
In many cases, and mostly in cases of a new license text, there are significant differences
between already seen licenses and this new license. So as a consequence, all the matched fragments
if stitched together, doesn’t contain the whole text. This is_suggested_matched_text_complete
attribute has this information.
Note
Now only issues with is_license_text as True has it’s is_suggested_matched_text_complete value as false.
All Issue Types¶
|
|
|
---|---|---|
|
|
The matched text is present in a file whose name is a known legal filename. |
|
|
The matched license text is present in a file whose name is not a known legal filename. |
|
|
Only parts of a larger license text are detected. |
|
|
A notice with a complex license expression (i.e. exceptions, choices or combinations). |
|
|
A notice with a single license. |
|
|
License notices with unknown licenses detected. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
A part of a license tag is detected |
|
|
A new/common structure of tags are detected with scope for being handled differently. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
Lead-ins to known license references are detected. |
|
|
License references with a incomplete match. |
|
|
Matched to an unknown rule as the license information is present in another file, which is referred to in this matched piece of text. |
|
|
A piece of code/text is incorrectly detected as a license. |
|
|
A piece of common introduction to a license text/notice/reference is detected. |
License Texts¶
All the issue_types with is_license_text as True.
License Text Files¶
Note
Value of issue_type:classification_id
:- text-legal-lic-files
[More Than 90% License Words/Legal File]
Here the “is_license_text” plugin is used to detected if it’s a License File or Not, also “is_legal” can be used for the detection, so an OR operation between these two cases.
So, if the full text is there in the “matched_text” we can go ahead and craft the rule from the
matched_text
.
License Texts in Files¶
Note
Value of issue_type:classification_id
:- text-non-legal-lic-files
[with less than 90% License Words]
In some cases, one of the “is_license_text” and “is_legal” tags, or even both could be False, and it still could be classified as a License Text because
the Rule it was partially matched was a license text rule
the
license-type
sentence classifier designated it as a license text
Note: In this case how “is_license_text” and “is_legal” is calculated could be updated, based on common mistakes.
Full text doesn’t exist in matched_text¶
Note
Value of issue_type:classification_id
:- text-lic-text-fragments
Where the Full text doesn’t exist in matched_text and we have to go to/fetch the source file which was scanned.
This is a common occurance in new unique license texts, which aren’t fully present. Normally these
are detected by the 3-seq
matcher stage.
On scanning License Texts Present in scancode, by reindexing the license index to the state before that particular text was added, we can see how the scan results look when entirely new license texts are encountered.
So it seems as the license text is large, and varies a lot from already existing license texts, the entire text doesn’t exist inside “matched_text”, so we have to go to the source file which was scanned and add it from there.
For example these are the results for the “cern-ohl-w-2.0.LICENSE” file scanned by taking scancode to a state where it wasn’t added.
Scan Result File has multiple partial matches
” it applies as licensed under CERN-OHL-S or CERN-OHL-W”
” licensed under CERN-OHL-S or CERN-OHL-W as appropriate.”
” licensed under a licence approved by the Free Software”
” interfaced, which remain licensed under their own applicable”
” direct, indirect, special, incidental, consequential, exemplary,n punitive or other damages of any character including, withoutn limitation, procurement of substitute goods or services, loss ofn use, data or profits, or business interruption, however causedn and on any theory of contract, warranty, tort (includingn negligence), product liability or otherwise, arising in any wayn in relation to the Covered Source, modified Covered Sourcen and/or the Making or Conveyance of a Product, even if advised ofn the possibility of such damages, and You shall hold the”
” 7.1 Subject to the terms and conditions of this Licence, each”
” You may treat Covered Source licensed under CERN-OHL-W as”
” licensed under CERN-OHL-S if and only if all Available”
Clearly the actual license has a lot more text, which we can only get by going to the source.
License Notices¶
All issue_types with their is_license_notice value as True.
Exceptions, Rules with Keys having AND/OR¶
Note
Value of issue_type:classification_id
:- notice-and-or-with-notice
Where there are multiple “notice” license detections, not of the same license name, in a single file. These are often:
dual licenses
exceptions
These have multiple license detections and some times new combinations are detected, and has to be added to the Rules.
Single key notices¶
Note
Value of issue_type:classification_id
:- notice-single-key-notice
This is the general case of License Notice cases, so if it’s a license notice case and doesn’t fall into the other license notice cases detailed below, then it belongs in this category.
These are often detected as License Notices are often unique in projects, and for these rules can be crafted with fairly high confidence as almost always the entire text is present in “matched_text”.
License Tags¶
All issue_types with their is_license_tag value as True.
Wrong License Tag Detections¶
Note
Value of issue_type:classification_id
:- tag-tag-coverage
Among all “is_license_tag” = True cases, if match_coverage is less than 100, then it is a wrong license detection, and as tags are small and matched_text almost always contains the whole tag, a Rule can be created from these class of Problems.
This is the general case of License Tag cases, so if it’s a license tag case and doesn’t fall into the other license tag cases detailed below, then it belongs in this category.
Other common Structures of Tags¶
Note
Value of issue_type:classification_id
:- tag-other-tag-structures
There exists specific Tags, for group of projects, and these are mostly found in source code files, in the code itself.
Like for example:
<small>License: <a href="http://www.fsf.org/licensing/licenses/fdl.html">GNU Free Documentation License (FDL)</a></small>
Or MODULE_LICENSE
present in linux kernel source code.
We can cluster the data according to occurrences of same types of structures, and attributes used to cluster/separate could be:
Programming Language
Type of Files?
Related Issue - https://github.com/nexB/scancode-toolkit/issues/707
Finding False Positives from License Tags Detections¶
Note
Value of issue_type:classification_id
:- tag-false-positives
.
There also exists notice-false-positives
and reference-false-positives
, similarly.
In these cases, value of issue_id
:- false-positives
Now, the “is_license_tag” is mostly true for these, but the “match_coverage” is always 100 in these cases. These are almost always wrongly detected by some handful of rules which has only the words gpl/lgpl or similar ones. So we further narrow our search down to only 1-3 word rules and and an additional criteria being if the license match occurs at line number more than a certain value, say 1000 or more.
But this also includes a lot of correct detections, which are correctly detected.
This classifying of “False Positives” from “Correct Tag Detection” is solely based on the matched text, and should be solved by a BERT based sentence classifier. The binary classification would be between false-positives and license-tags.
The data needed to train that model, which we can get from two places:-
The already existing scancode license rules, has a lot of examples of False Positives and Correct License Tags
More training data
We could make use of the classifier confidence scores to only look at ambigous cases only.
Note
In some cases some more lines above and below are needed to be added to these false_positive
rules, as the matched_text
can be too general for a false positive rule. This could require
manual work.
License References¶
All the issue_types with is_license_reference as True.
Those with low match coverages¶
Note
Value of issue_type:classification_id
:- reference-low-coverage-refs
This is the most common type of license detection errors, as there exist a lot of
license references, and they can be added. These are also highly fixable problems, as almost always
the whole license reference is captured in matched_text
We should separate these location wise, and add as new rules without any manual oversight.
This is the general case of License Reference cases, so if it’s a license reference case and doesn’t fall into the other license reference cases detailed below, then it belongs in this category.
unknown file license references¶
Note
Value of issue_type:classification_id
:- reference-to-local-file
In many cases the license that is referred to is in another file, and only the filename is given, and not the license name. Example - “see license in file LICENSE.txt”
In these cases if there are more context/specific wording add these as new unknown rules.
So we separate these based on their matched_rules, i.e. if these are matched to an “unknown” or similar kinds of non-explicitly named rules.
Other wise discard, as this is a issue to be handled separately, by implementing a system in scancode where these links are followed and their license added.
Introduction to a License Notice¶
Note
Value of issue_type:classification_id
:- reference-lead-in-or-unknown-refs
There are cases where the RULE name begins with lead-in_unknown_
, i.e. these are known lead-ins
to licenses, so even if the exact license isn’t detected, it can be reported that there is a
license reference here.
Here we could add to the Scancode Rules, the license reference, or as in the example case below, craft a new rule by joining the two existing ones
Example case:-
Dual licensed under
is lead-in_unknown_30.RULE
say there is another rule: MIT and GPL
and the text we scan is : Dual licensed under MIT and GPL
To Note: If they appear quite frequently, it is okay to craft a new rule. Because we cannot just add all combinations of lead-ins and license names.