Age | Commit message (Collapse) | Author |
|
It is supposed to separate the parsing of Debian packages (understanding
how the format works) from the actual feature extraction. Its goal is to
simplify writing custom extractors for different feature sets.
|
|
|
|
Instead of carefully crafting an iterator to pass to yaml.safe_dump_all,
we simply take control on our own and call represent on a yaml dumper
object where needed.
|
|
|
|
|
|
Otherwise the yaml will contain binary strings on py3k which end up as
binary data in the sqlite database. In py2, yaml can handle those
unicode objects just fine.
|
|
|
|
|
|
|
|
zlib.crc32 returns a int32_t on py2 and a uint32_t on py3.
|
|
|
|
|
|
|
|
While in current sid packages the control file in control.tar is always
named "./control", some older packages name it "control".
|
|
wording, more NOT NULLs, some more explanations
|
|
Thanks to Peter Palfrader for explaining what information is needed and
reviewing the documentation.
|
|
|
|
|
|
Creating packages with lzma compression has been deprecated since dpkg
1.16.4, but there might be some of those in the wild and supporting them
is strightforward when xz is already supported.
Signed-off-by: Guillem Jover <guillem@debian.org>
|
|
dpkg supports those since 1.17.6.
Signed-off-by: Guillem Jover <guillem@debian.org>
|
|
The GNU ar format adds a trailing slash to the member names, normalize
the member names to take this into account.
Signed-off-by: Guillem Jover <guillem@debian.org>
|
|
|
|
|
|
|
|
|
|
|
|
Reported-By: Stefan Kaltenbrunner
|
|
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
|
|
Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.
Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.
|
|
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
|
|
|
|
Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.
Reported-By: Olly Betts
|
|
|
|
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
|
|
Leaking them can result in running out of available filedescriptors.
|
|
|
|
|
|
|
|
|
|
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
|
|
|
|
|
|
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
|
|
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
|
|
|
|
|
|
|
|
|
|
|
|
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
|