Suggesting New License Detection¶
Firstly this analysis is then used to suggest new license matches in place of the matches with different license detection issues, to rectify those issues and provide a better license detection. This helps in quickly identifying issues when looking at a scan result or in larger use cases/audits.
This can also help in sub-sequent rule addition in the scancode rules, to enhance the scancode data.
Note
The rule/.yml file generation/other actions is Work In Progress. Only reports with suggested rules with the rule text, license key, rule type is generated.
Matched Text¶
When two matched texts having a common boundary and has a common substring, Like in this example:-
start_line - 16
end_line - 17
matched_text:
* You should have received a copy of the GNU Lesser General Public\n * License along with this library; if not, write to the Free Software
start_line - 17
end_line - 19
matched_text:
* License along with this library; if not, write to the Free Software\n * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,\n * MA 02110-1301 USA
Here, License along with this library; if not, write to the Free Software
is a substring of both
matched_texts, so we cannot simply join them, we have to join them without this repetition.
If they do have a common boundary but do not have common substrings, then they are joined simply.
Now if they do not have a common boundary,
Less than or equal to 4 lines gap: They are joined as one Rule
More than 4 lines gap: They are made two separate rules
Predict License Expression¶
The steps are as follows:
First from the list of license expressions, all the license expressions are sorted according to their occurrences.
Generic license_expressions like unknown, warranty-disclaimer are removed from this sorted list.
If there’s only one license_expression with the most number of occurrences, then that is the predicted license_expression.
In case of same number of license_expressions for multiple matches, the license_expression of the license match with the highest matched_length given as the prediction.
.yml
file attributes¶
The boolean value denoting the license type, i.e. license text/notice/tag/reference is determined from their respective class of problem, which they are already divided into.
The
ignorable
attributes could be added later by using scripts.The possible license id (like
mit
) is predicted as the license ID of the match with the longestmatch_coverage
. This has to be manually verified in most cases.If the rule is a
false_positive
as determined from the class of problem, only theis_false_positive
attribute is there.
Rule Confidence¶
Calculation of rule-confidence
for manually checking only the low confidence errors.