Age | Commit message (Collapse) | Author |
|
Teach importpkg how to download urls using urlopen and thus remove the
need for invoking curl.
|
|
|
|
The handle_ar_member and handle_ar_end methods now have a default
implementation adding further handlers handle_debversion,
handle_control_tar and handle_data_tar.
In that process two additional bugs were fixed:
* decompress_tar was wrongly passing errors="surrogateescape" for
Python 2.x even though that's only supported for Python 3.x.
* The use of decompress actually passes the extension as unicode.
|
|
In Python 2.x, TarInfo.name is a bytes object. In Python 3.x,
TarInfo.name always is a unicode object. To avoid importpkg crashing
with an exception, we direct the Python 3.x decoding to use
surrogateescapes. Thus decoding the name boils down to checking whether
it contains surrogates.
|
|
Building on the previous commit, add a decompress function that turns a
compressed filelike into a decompressed filelike. Use it to decouple the
decompression step.
|
|
They really are an aspect of the particular extractor and can easily be
changed by subclassing.
|
|
It is supposed to separate the parsing of Debian packages (understanding
how the format works) from the actual feature extraction. Its goal is to
simplify writing custom extractors for different feature sets.
|
|
|
|
Instead of carefully crafting an iterator to pass to yaml.safe_dump_all,
we simply take control on our own and call represent on a yaml dumper
object where needed.
|
|
|
|
|
|
|
|
While in current sid packages the control file in control.tar is always
named "./control", some older packages name it "control".
|
|
|
|
Creating packages with lzma compression has been deprecated since dpkg
1.16.4, but there might be some of those in the wild and supporting them
is strightforward when xz is already supported.
Signed-off-by: Guillem Jover <guillem@debian.org>
|
|
dpkg supports those since 1.17.6.
Signed-off-by: Guillem Jover <guillem@debian.org>
|
|
|
|
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
|
|
|
|
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
|
|
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
|
|
|
|
|
|
|
|
|
|
importpkg.py now emits a yaml stream instead of updating the database.
The acutual updating now happens in readyaml.py. In this process
autoimport.py was significantly reworked to import packages in parallel.
|
|
|
|
Also document it.
|
|
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|