summaryrefslogtreecommitdiff
path: root/importpkg.py
AgeCommit message (Collapse)Author
2016-05-24use urlopen from urllib2 on py2Helmut Grohne
This causes non-successful fetches to result in HTTPErrors like it does in py3 already.
2016-05-23move dedup.debpkg.process_control back into importpkgHelmut Grohne
After all, it isn't that generic. It knows what information is necessary for running dedup. Thus it really belongs to the extractor subclass. By building on handle_control_info, not that much parsing logic is left in the extractor subclass.
2016-05-23importpkg: fix --hash broken in previous commitHelmut Grohne
2016-05-23remove curl dependencyHelmut Grohne
Teach importpkg how to download urls using urlopen and thus remove the need for invoking curl.
2016-05-21move from deprecated optparse to argparseHelmut Grohne
2016-05-01push more functionality into DebExtractorHelmut Grohne
The handle_ar_member and handle_ar_end methods now have a default implementation adding further handlers handle_debversion, handle_control_tar and handle_data_tar. In that process two additional bugs were fixed: * decompress_tar was wrongly passing errors="surrogateescape" for Python 2.x even though that's only supported for Python 3.x. * The use of decompress actually passes the extension as unicode.
2016-04-28support Python 3.x in importpkgHelmut Grohne
In Python 2.x, TarInfo.name is a bytes object. In Python 3.x, TarInfo.name always is a unicode object. To avoid importpkg crashing with an exception, we direct the Python 3.x decoding to use surrogateescapes. Thus decoding the name boils down to checking whether it contains surrogates.
2016-04-28decouple a function decompress out of decompress_tarHelmut Grohne
Building on the previous commit, add a decompress function that turns a compressed filelike into a decompressed filelike. Use it to decouple the decompression step.
2016-04-21importpkg: move the hash function list to the extractor classHelmut Grohne
They really are an aspect of the particular extractor and can easily be changed by subclassing.
2016-04-19add a class DebExtractor for guiding feature extractionHelmut Grohne
It is supposed to separate the parsing of Debian packages (understanding how the format works) from the actual feature extraction. Its goal is to simplify writing custom extractors for different feature sets.
2016-04-16add a validate method to HashedStreamHelmut Grohne
2016-04-16importpkg: use yaml dumper directlyHelmut Grohne
Instead of carefully crafting an iterator to pass to yaml.safe_dump_all, we simply take control on our own and call represent on a yaml dumper object where needed.
2016-04-16importpkg: refactor commit handling out of process_package*Helmut Grohne
2015-04-16use binary stdin on py3kHelmut Grohne
2015-04-16distinguish bytes from unicode for py3kHelmut Grohne
2014-07-23importpkg: be more liberal in control file namingHelmut Grohne
While in current sid packages the control file in control.tar is always named "./control", some older packages name it "control".
2014-05-11importpkg: reduce copy&pasteHelmut Grohne
2014-05-11importpkg: add support for data.tar.lzmaGuillem Jover
Creating packages with lzma compression has been deprecated since dpkg 1.16.4, but there might be some of those in the wild and supporting them is strightforward when xz is already supported. Signed-off-by: Guillem Jover <guillem@debian.org>
2014-05-11importpkg: add support for control.tar and control.tar.xzGuillem Jover
dpkg supports those since 1.17.6. Signed-off-by: Guillem Jover <guillem@debian.org>
2014-02-23spell check commentsHelmut Grohne
2014-02-19blacklist content rather than hashesHelmut Grohne
Otherwise the gzip hash cannot tell the empty stream and the compressed empty stream apart.
2013-09-02importpkg: move library-like parts to dedup.debpkgHelmut Grohne
2013-08-19importpkg: don't blacklist boring gzip_sha512 hashesHelmut Grohne
* In practise there are very few compressed files with trivial hashes. * Blacklisting these values results in false positives in the gzip issues.
2013-08-01support hashing gif imagesHelmut Grohne
* Rename "image_sha512" to "png_sha512". * dedup.image.ImageHash is now a base class for image hashes such as PNGHash and GIFHash. * Enable both hashes in importpkg. * Fix README. * Add new hash combinations to webapp. * Add "gif file not named *.gif" to issues in update_sharing. * Add redirect for "image_sha512" to webapp for backwards compatibility.
2013-07-29importpkg.py: support uncompressed data.tarHelmut Grohne
2013-07-26verify package hashes when importing via httpHelmut Grohne
2013-07-12importpkg: simplify state logicHelmut Grohne
2013-07-12importpkg: split process_package to process_controlHelmut Grohne
2013-06-10split the import phase to a yaml streamHelmut Grohne
importpkg.py now emits a yaml stream instead of updating the database. The acutual updating now happens in readyaml.py. In this process autoimport.py was significantly reworked to import packages in parallel.
2013-03-26Merge branch schemachangeHelmut Grohne
2013-03-12move ArReader from importpkg to dedup.arreaderHelmut Grohne
Also document it.
2013-03-09split content table to a hash tableHelmut Grohne
In the old content table (package, filename, size) would be the same for multiple hash functions. Now the schema represents that each file has precisely one size, but multiple hashes.
2013-03-07enable enforcing foreign keysHelmut Grohne
2013-03-07integrate the source table into the package tableHelmut Grohne
2013-03-05importpkg: source header may contain a versionHelmut Grohne
2013-03-04importpkg: record the source package relationshipHelmut Grohne
2013-03-02move sql schema to a separate fileHelmut Grohne
2013-02-24hash image contentsHelmut Grohne
2013-02-23importpkg: ignore filenames with encoding errorsHelmut Grohne
2013-02-21move compression functions to module dedup.compressionHelmut Grohne
2013-02-21move hashing functions to module dedup.hashingHelmut Grohne
2013-02-21rename test.py to importpkg.pyHelmut Grohne