summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-05-01use same Python version for autoimport and importpkgHelmut Grohne
The autoimport tool runs the Python interpreter explicitly. Instead of invoking just "python" and thus calling whatever the current default is, use sys.executable which is the interpreter used to run autoimport, thus locking both to the same Python version.
2016-04-28support Python 3.x in importpkgHelmut Grohne
In Python 2.x, TarInfo.name is a bytes object. In Python 3.x, TarInfo.name always is a unicode object. To avoid importpkg crashing with an exception, we direct the Python 3.x decoding to use surrogateescapes. Thus decoding the name boils down to checking whether it contains surrogates.
2016-04-28decouple a function decompress out of decompress_tarHelmut Grohne
Building on the previous commit, add a decompress function that turns a compressed filelike into a decompressed filelike. Use it to decouple the decompression step.
2016-04-28extend functionality of DecompressedStreamHelmut Grohne
It now supports: * tell() * seek(absolute_position), forward only * close() * closed This is sufficient for putting it as a fileobj into tarfile.TarFile. By doing so we can decouple decompression from tar processing, which eases papering over the Python 2.x vs Python 3.x differences.
2016-04-21importpkg: move the hash function list to the extractor classHelmut Grohne
They really are an aspect of the particular extractor and can easily be changed by subclassing.
2016-04-19add a class DebExtractor for guiding feature extractionHelmut Grohne
It is supposed to separate the parsing of Debian packages (understanding how the format works) from the actual feature extraction. Its goal is to simplify writing custom extractors for different feature sets.
2016-04-16add a validate method to HashedStreamHelmut Grohne
2016-04-16importpkg: use yaml dumper directlyHelmut Grohne
Instead of carefully crafting an iterator to pass to yaml.safe_dump_all, we simply take control on our own and call represent on a yaml dumper object where needed.
2016-04-16importpkg: refactor commit handling out of process_package*Helmut Grohne
2016-04-08urlopen moved from urllib to urllib.request in py3kHelmut Grohne
2015-04-16process_control: do not encode to asciiHelmut Grohne
Otherwise the yaml will contain binary strings on py3k which end up as binary data in the sqlite database. In py2, yaml can handle those unicode objects just fine.
2015-04-16tempfile.mkdtemp does not like bytes in py3kHelmut Grohne
2015-04-16unquote moved from urllib to urllib.parse in py3kHelmut Grohne
2015-04-16element access on bytes yields int in py3kHelmut Grohne
2015-04-16zlib.crc32 behaves inconsistently on py2 vs py3Helmut Grohne
zlib.crc32 returns a int32_t on py2 and a uint32_t on py3.
2015-04-16there is no itertools.imap in py3kHelmut Grohne
2015-04-16use binary stdin on py3kHelmut Grohne
2015-04-16distinguish bytes from unicode for py3kHelmut Grohne
2014-07-23importpkg: be more liberal in control file namingHelmut Grohne
While in current sid packages the control file in control.tar is always named "./control", some older packages name it "control".
2014-06-14improve schema documentationHelmut Grohne
wording, more NOT NULLs, some more explanations
2014-06-14add documentation to schema.sqlHelmut Grohne
Thanks to Peter Palfrader for explaining what information is needed and reviewing the documentation.
2014-05-11update copyright informationHelmut Grohne
2014-05-11importpkg: reduce copy&pasteHelmut Grohne
2014-05-11importpkg: add support for data.tar.lzmaGuillem Jover
Creating packages with lzma compression has been deprecated since dpkg 1.16.4, but there might be some of those in the wild and supporting them is strightforward when xz is already supported. Signed-off-by: Guillem Jover <guillem@debian.org>
2014-05-11importpkg: add support for control.tar and control.tar.xzGuillem Jover
dpkg supports those since 1.17.6. Signed-off-by: Guillem Jover <guillem@debian.org>
2014-05-11dedup.arreader: remove trailing slash from ar membersGuillem Jover
The GNU ar format adds a trailing slash to the member names, normalize the member names to take this into account. Signed-off-by: Guillem Jover <guillem@debian.org>
2014-05-11webapp: allow git-like hash truncationHelmut Grohne
2014-04-21autoimport: support protocols besides httpHelmut Grohne
2014-03-08schema: make syntax compatible with postgresHelmut Grohne
2014-02-23Merge branch updatesharing-eqclassHelmut Grohne
2014-02-23spell check commentsHelmut Grohne
2014-02-23fix spelling mistakeHelmut Grohne
Reported-By: Stefan Kaltenbrunner
2014-02-23webapp: fix eqclass usage in package comparisonHelmut Grohne
When comparing two packages, objects would be considered duplicates without considering whether the respective hash functions are comparable by checking their equivalence classes. The current set of hash functions does not expose this bug.
2014-02-21update_sharing: weaken assumptions about db layoutHelmut Grohne
Hash functions are partitioned into equivalence classes. We are generally only interested in sharing among hash functions with the same equivalence class, but the algorithm would compute any sharing. While the current layout never produces the same hashes for functions in difference equivalence classes (for different output length), that may change in future. Also allow hash functions, that belong to no equivalence class at all (eqclass = NULL) as a means to add additional metadata to content without computing any sharing for it.
2014-02-19blacklist content rather than hashesHelmut Grohne
Otherwise the gzip hash cannot tell the empty stream and the compressed empty stream apart.
2014-02-19GzipDecompressor: don't treat checksum as garbage trailerHelmut Grohne
2014-02-19DecompressedHash should fail on trailing inputHelmut Grohne
Otherwise all files smaller than 10 bytes are successfully hashed to the hash of the empty input when using the GzipDecompressor. Reported-By: Olly Betts
2013-10-03work around python-debian's #670679Helmut Grohne
2013-09-11webapp: open cursors less oftenHelmut Grohne
On the main instance opening cursors equals initiating a connection. Unfortunately sqlite3.Connection.close does not close filedescriptors. So just open less cursors to leak filedescriptors less often.
2013-09-10webapp: close database cursorsHelmut Grohne
Leaking them can result in running out of available filedescriptors.
2013-09-04webapp: serve static files from /staticHelmut Grohne
2013-09-02add option -d --database for db path to all scriptsHelmut Grohne
2013-09-02autoimport: avoid hard coded temporary directoryHelmut Grohne
2013-09-02importpkg: move library-like parts to dedup.debpkgHelmut Grohne
2013-08-19importpkg: don't blacklist boring gzip_sha512 hashesHelmut Grohne
* In practise there are very few compressed files with trivial hashes. * Blacklisting these values results in false positives in the gzip issues.
2013-08-16make debian version_compare available in sqlHelmut Grohne
2013-08-16webapp templates: add an anchor for file issuesHelmut Grohne
2013-08-02model comparability as an equivalence relationHelmut Grohne
webapp has had a relation hash_functions, that modeled "comparable functions". Images should not be compares to other files, since it makes no sense to store them as the RGBA stream, that is being hashed. This comparability property resembles an equivalence relation. So the function table gains a column eqclass. Each class is represented by a number and functions are statically assigned to these classes. Now the filtering happens in SQL instead of Python.
2013-08-01support hashing gif imagesHelmut Grohne
* Rename "image_sha512" to "png_sha512". * dedup.image.ImageHash is now a base class for image hashes such as PNGHash and GIFHash. * Enable both hashes in importpkg. * Fix README. * Add new hash combinations to webapp. * Add "gif file not named *.gif" to issues in update_sharing. * Add redirect for "image_sha512" to webapp for backwards compatibility.
2013-07-30templates/binary: space between package and compareHelmut Grohne