Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package dependency completness field #3402

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Oct 31, 2024

Description

Adds a field that describes the completeness of a package's direct dependencies:

  • complete: the package has all of its direct dependencies resolved and related to this package.
  • incomplete: indicates that the package is known to not have all of its direct dependencies listed. This is reserved for cases where we know there are a non-zero number of dependencies for a package, but we are not listing them intentionally or because we are unable to resolve them.
  • complete-with-indirect: a superset of complete -- indicates that the package has all of its direct dependencies resolved as well as some or all of indirect dependencies. What is notable about this is that direct and indirect dependencies are linked directly to this package and are not separable (you cannot distinguish between a direct and indirect dependency from the perspective of this package).
  • unknown: indicates that the completeness of the dependencies cannot be considered positively complete or incomplete. This should be used when the dependency resolution mechanism is not well understood, the set of dependencies is unknowable, or no attempt has been made to resolve dependencies (no assertion).

In addition to adding this new field, all catalogers were updated to raise up accurate values for this field:

Cataloger Completeness Comment
alpm-db-cataloger complete
apk-db-cataloger complete
binary-classifier-cataloger unknown classifiers are limited to identifying package identities, but have no information about dependencies (though there may be dependencies via shared libs and other mechanisms [static lib at compile time])
cargo-auditable-binary-cataloger unknown
cocoapods-cataloger unknown
conan-cataloger conanfile.txt: unknown, conan.lock: complete
conan-info-cataloger complete
dart-pubspec-lock-cataloger unknown
dotnet-deps-cataloger complete
dotnet-portable-executable-cataloger unknown
dpkg-db-cataloger complete
elf-binary-package-cataloger unknown though we can look for shared libs, we cannot see static dependencies nor dynamic dependencies using dlopen. This means that, even in cases where the dep info is actually complete, we can't programmatically determine that.
elixir-mix-lock-cataloger unknown
erlang-otp-application-cataloger unknown
erlang-rebar-lock-cataloger unknown
github-action-workflow-usage-cataloger complete the only known exception to this is shared workflows
github-actions-usage-cataloger unknown
go-module-binary-cataloger complete-with-indirect / unknown the main module gets the complete-with-indirect but all dependencies get unknown
go-module-file-cataloger unknown there is no main module discovered to make relationships for
graalvm-native-image-cataloger unknown or if anything is decoded from the internal SBOM, then that value is used
haskell-cataloger unknown
java-archive-cataloger complete / unknown all packages are assumed to be unknown unless searching for transitive dependencies configuration option has been enabled
java-gradle-lockfile-cataloger unknown
java-jvm-cataloger unknown
java-pom-cataloger unknown
javascript-lock-cataloger unknown
javascript-package-cataloger unknown we find all dependency nodes but are not crafting relationships
linux-kernel-cataloger unknown
lua-rock-cataloger unknown
nix-store-cataloger unknown
opam-cataloger unknown
php-composer-installed-cataloger unknown
php-composer-lock-cataloger unknown
php-pecl-serialized-cataloger unknown
portage-cataloger unknown
python-installed-package-cataloger complete there is a WithResolvingProcessors(wheelEggRelationships) on the cataloger
python-package-cataloger requirements.txt: unknown, setup.py: unknown, poetry.lock: complete, Pipfile.lock: unknown
r-package-cataloger unknown
rpm-archive-cataloger unknown
rpm-db-cataloger complete
ruby-gemfile-cataloger unknown
ruby-gemspec-cataloger unknown
ruby-installed-gemspec-cataloger unknown
rust-cargo-lock-cataloger complete
sbom-cataloger unknown
swift-package-manager-cataloger unknown
swipl-pack-cataloger unknown
wordpress-plugins-cataloger unknown

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

@github-actions github-actions bot added the json-schema Changes the json schema label Oct 31, 2024
@wagoodman wagoodman force-pushed the note-dep-quality branch 2 times, most recently from bb1b51b to 141d19a Compare October 31, 2024 16:27
@wagoodman wagoodman marked this pull request as ready for review October 31, 2024 16:28
@wagoodman wagoodman requested a review from a team November 15, 2024 14:44
@willmurphyscode
Copy link
Contributor

@wagoodman I wanted to make sure I understand why the rust-cargo-lock-cataloger is labeled as "complete" not as "mixed".

It seems like as we make a set of dependency relationships more complete, it passes from complete, to mixed, and back to complete, which is surprising:

  1. It finds all the direct dependencies and nothing else -> "complete"
  2. It finds all the direct and some or all indirect dependencies, but you can't tell how direct the dependencies are -> "mixed"
  3. It finds all direct and indirect dependencies, and you can tell which ones are direct -> "complete"

It seems weird to me that a cataloger can effectively get promoted from complete to mixed, and then from mixed to complete, by adding relationship data.

When I first read the descriptions above, I thought, "rust isn't complete, it's mixed, because it includes transitive dependencies."

Am I understanding correctly here?

@wagoodman
Copy link
Contributor Author

wagoodman commented Nov 15, 2024

Here each node in the cargo.lock is describing only direct dependencies so this is complete

@willmurphyscode
Copy link
Contributor

Cross posting #3010 (comment) so that it gets seen here.

Language pkg.Language `json:"language"`
CPEs cpes `json:"cpes"`
PURL string `json:"purl"`
Dependencies pkg.DependencyCompleteness `json:"dependencies"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right name for this field? licenses is the actual licenses, cpes is the actual CPEs, locations is the actual locations, but dependencies is the completeness and directness of the dependencies. What about dependencyCompleteness or something? I hate to make it that much longer, but I also hate to have dependencies be the only plural noun on here that's no a slice of noun.

CPEs: cpes,
PURL: p.PURL,
Metadata: p.Metadata,
Name: p.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to see syft/format/... for other formats. Do SPDX and CDX have anywhere to put dependency completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SPDX only in 3.x (which we don't support yet), for Cyclone there is support for it in the format spec for versions we support -- I'll update the PR to map appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this change get missed? I don't see a diff in the cyclonedx and cyclonedx json formatters, and I expected to. (I think it's also ok to do this as a follow up PR, since the current PR is large.)

PURL: portableExecutablePackageURL(name, version),
Metadata: metadata,
PURL: portableExecutablePackageURL(name, ver),
// by nature PE metadata does not have any dependency information, thus we are forced to claim incomplete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify the difference between unknown and incomplete one more time:

  • Dependency is known to be incomplete or known to be missing entirely -> Incomplete
  • It is not knowable whether dependency data is complete -> unknown.

Is "PE file does not include dependency information, so this package's dependencies are known to be incomplete." a good summary of the comment here then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic definitions:

incomplete: the package does not have all of its direct dependencies resolved.
unknown: indicates when dependency resolution mechanism for this package is not well understood.

I've struggled through most of development on a good firm definition for unknown vs incomplete.

Is "PE file does not include dependency information, so this package's dependencies are known to be incomplete." a good summary of the comment here then?

It seems like one, but I'm stuck on "dependencies are known to be incomplete", which isn't quite true.

Do you think we should extend unknown to really mean "no assertion" (for any reason), in which case may things listed as incomplete in the interim would switch to unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the definitions based off of our offline discussions:

incomplete: indicates that the package is known to not have all of its direct dependencies listed.

unknown: indicates that the completeness of the dependencies cannot be considered positively complete or incomplete. This should be used when the dependency resolution mechanism is not well understood, the set of dependencies is unknowable, or no attempt has been made to resolve dependencies (no assertion).

That also means that I don't think we have a case yet where we would list 'incomplete'.

relationships = append(relationships, artifact.Relationship{ //nolint:gocritic // we intentionally want to use the reference to the package which will still be mutated and finalized to a value later
// both the to and from packages may be mutated based on the resolved dependencies, so we need references to these values
// to ensure the nodes used are consistent with the final state of the packages.
// note: it is VITAL that these references are replaced with values by the caller of this function before using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of comments always make me nervous. Is there something we can do to make sure this happens? What if we return something that represents instructions for building an []artifact.Relationships in the caller, so that the caller can't forget and return a []artifact.Relationships with unexpected types in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's what the finalizeRelationships call at the final return of the cataloger (a chokepoint)

return finalizePackages(pkgs), finalizeRelationships(relationships), errs
}
func finalizeRelationships(relationships []artifact.Relationship) []artifact.Relationship {
for i := range relationships {
if f, ok := relationships[i].From.(*pkg.Package); ok {
relationships[i].From = *f
}
if t, ok := relationships[i].To.(*pkg.Package); ok {
relationships[i].To = *t
}
}
return relationships
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that. I think I'd rather the type system remember that finalizeRelationships needs to be called. For example, you could have a local struct unfinalizedRelationship and return a slice of those, and then finalize relationship can take a slice of unfinalizedRelationship and return a slice of a artifact.Relationship. That way if someone tries to use this function without finalizing the relationships, the compiler won't let them.

@@ -46,6 +47,8 @@ func newPackageFromAudit(dep *rustaudit.Package, locations ...file.Location) pkg
Language: pkg.Rust,
Type: pkg.RustPkg,
Locations: file.NewLocationSet(locations...),
// no attempt is made by the parser function to resolve dependencies
Dependencies: pkg.IncompleteDependencies,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rust auditable binary cataloger is a lot like the Go binary cataloger. I think this cataloger can be changed to emit a "main" package and CompleteWithIndirect. Or do we not trust the completeness of this metadata if it's present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying not to change the cataloger logic in this PR -- what you're describing would be a good follow up enhancement for this cataloger

Type Type `cyclonedx:"type"` // the package type (e.g. Npm, Yarn, Python, Rpm, Deb, etc)
CPEs []cpe.CPE `hash:"ignore"` // all possible Common Platform Enumerators (note: this is NOT included in the definition of the ID since all fields on a CPE are derived from other fields)
PURL string `hash:"ignore"` // the Package URL (see https://github.com/package-url/purl-spec)
Dependencies DependencyCompleteness `hash:"ignore"` // the completeness of the dependency information for this package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I said this above, but I don't think Dependencies is the right name for this field. Licenses is of type LicenseSet, which is a set of licenses. CPEs is a slice of cpe.CPE. Locations is a LocationSet. Dependencies is an enum about how complete the dependencies are. It seems to break an established pattern.

Locations: file.NewLocationSet(file.NewVirtualLocation("/hello-auditable", "/hello-auditable")),
Language: pkg.Rust,
Type: pkg.RustPkg,
Dependencies: pkg.UnknownDependencyCompleteness,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to change now that #3500 is merged?

@@ -46,6 +47,8 @@ func newPackageFromAudit(dep *rustaudit.Package, locations ...file.Location) pkg
Language: pkg.Rust,
Type: pkg.RustPkg,
Locations: file.NewLocationSet(locations...),
// no attempt is made by the parser function to resolve dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the attempt is made now that #3500 is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle this on the rebase

@@ -9,17 +9,18 @@ import (
"github.com/anchore/syft/syft/pkg/cataloger/internal/pkgtest"
)

func TestParsePackPackage(t *testing.T) {
func xTestParsePackPackage(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is committing this x intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, that's a typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
json-schema Changes the json schema
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supply "depth" information when including relationships
2 participants