summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-05-09add type annotations to most of the codeHEADmasterHelmut Grohne
2022-01-12webapp.py: fuse two sql queries in get_detailsHelmut Grohne
2022-01-12webapp.py: generalize url routerHelmut Grohne
Each view now has its own view function following the show_ pattern and it accepts its parsed parameters as keyword-only arguments now.
2021-12-31importpkg.py + readyaml.py: prefer the C libyaml implementationHelmut Grohne
2021-12-31dedup.utils: uninline helper function iterate_packagesHelmut Grohne
2021-12-31webapp.py: consistently close cursors using context managersHelmut Grohne
2021-12-30DecompressedStream: improve performanceHelmut Grohne
When the decompression ratio is huge, we may be faced with a large (multiple megabytes) bytes object. Slicing that object incurs a copy becomes O(n^2) while appending and trimming a bytearray is much faster.
2021-12-29DecompressedStream: fix endless loopHelmut Grohne
Fixes: 775bdde52ad5 ("DecompressedStream: avoid mixing types for variable data")
2021-12-29webapp: avoid changing variable typeHelmut Grohne
Again static type checking is the driver for the change here.
2021-12-29autoimport: avoid changing variable typeHelmut Grohne
knownpkgvers is a dict while knownpkgs is a set. Separating them helps static type checkers.
2021-12-29webapp: speed up encode_and_bufferHelmut Grohne
We now know that our parameter is a jinja2.environment.TemplateStream. Enable buffering and accumulate via an io.BytesIO to avoid O(n^2) append.
2021-12-29webapp: improve performanceHelmut Grohne
html_response expects a str-generator, but when we call the render method, we receive a plain str. It can be iterated - one character at a time. That's what encode_and_buffer will do in this case. So better stream all the time.
2021-12-29webapp: forward compatibility with newer werkzeugHelmut Grohne
2021-12-29autoimport.py: convert to use pathlibHelmut Grohne
2021-12-29importpkg: fix suprression of boring contentHelmut Grohne
The content must be bytes. Passing str silently skips the suppression.
2021-12-29DecompressedHash: also gain a name property for consistencyHelmut Grohne
2021-12-29ImageHash: gain a name propertyHelmut Grohne
Instead of retroactively attaching a name to an ImageHash, autogenerate it via a property. Doing so also simplifies static type checking.
2021-12-29don't return the first parameter from hash_fileHelmut Grohne
Returning the object gets us into trouble as to what precisely the return type is at no benefit.
2021-12-29drop unused function sql_add_version_compareHelmut Grohne
2021-12-29DecompressedStream: avoid mixing types for variable dataHelmut Grohne
The local variable data can be bool or bytes. That's inconvenient for static type checkers. Avoid doing so.
2021-12-29DecompressedStream: eliminate redundant closed fieldHelmut Grohne
2020-10-25drop obsolete python modulesHelmut Grohne
Both lzma and concurrent.futures are now part of the standard library and solely exist as virtual packages.
2020-10-25externalize ar parsing to arpyHelmut Grohne
2020-10-25use python3-pil instead of removed python3-imagingHelmut Grohne
2020-02-16drop support for Python 2.xHelmut Grohne
2018-06-25adapt to python3-magic/2:0.4.15-1 APIHelmut Grohne
2017-09-23add module dedup.filemagicHelmut Grohne
This module is not used anywhere and thus its dependency on python3-magic is not recorded in the README. It can be used to guess the file type by looking at the contents using file magic. It is not a typical hash function, but it can be used for repurposing dedup for other analysers.
2017-09-13fix HashBlacklistContent.copyHelmut Grohne
It wasn't copying the stored member and thus could be blacklist "wrong" content after a copy.
2016-11-13autoimport: fix regresion in url computationHelmut Grohne
The list path got inadvertently prepended to all binary package urls. Fixes: 420804c25797 ("autoimport: improve fetching package lists")
2016-07-29repository movedHelmut Grohne
2016-06-09DecompressedStream: fix decompression without flushHelmut Grohne
In Python 3.x, lzma.LZMADecompressor doesn't have a flush method.
2016-06-09autoimport: fix hash checkHelmut Grohne
Fixes: 2f12a6e2f426 ("autoimport: add option to skip hash checking")
2016-05-25autoimport: improve fetching package listsHelmut Grohne
Moving the fetching part into dedup.utils. Instead of hard coding the gzip compressed copy, try xz, gz and plain in that order. Also take care to actually close the connection.
2016-05-24use urlopen from urllib2 on py2Helmut Grohne
This causes non-successful fetches to result in HTTPErrors like it does in py3 already.
2016-05-23move dedup.debpkg.process_control back into importpkgHelmut Grohne
After all, it isn't that generic. It knows what information is necessary for running dedup. Thus it really belongs to the extractor subclass. By building on handle_control_info, not that much parsing logic is left in the extractor subclass.
2016-05-23DebExtractor: implement parsing of control.tarHelmut Grohne
2016-05-23importpkg: fix --hash broken in previous commitHelmut Grohne
2016-05-23remove curl dependencyHelmut Grohne
Teach importpkg how to download urls using urlopen and thus remove the need for invoking curl.
2016-05-23autoimport: add option to skip hash checkingHelmut Grohne
For variations of dedup, that do not consume the data.tar member, this option can save significant bandwidth.
2016-05-22autoimport: stream package list and use generic decompressorHelmut Grohne
* streaming means that we do not need to hold the entire package list in memory (but the pkgs dict will become large anyway). * The decompress utility allows easily switching to e.g. xz which is the only compression format for the dbgsym suites.
2016-05-22DecompressedStream: implement readlineHelmut Grohne
Iteration over file-like is required by deb822.Packages.iter_paragraphs.
2016-05-21move from deprecated optparse to argparseHelmut Grohne
2016-05-05treat Pre-Depends like regular DependsHelmut Grohne
The former behaviour was ignoring them. The intended use for dedup is to know whenever a package unconditionally requires another package.
2016-05-01push more functionality into DebExtractorHelmut Grohne
The handle_ar_member and handle_ar_end methods now have a default implementation adding further handlers handle_debversion, handle_control_tar and handle_data_tar. In that process two additional bugs were fixed: * decompress_tar was wrongly passing errors="surrogateescape" for Python 2.x even though that's only supported for Python 3.x. * The use of decompress actually passes the extension as unicode.
2016-05-01use same Python version for autoimport and importpkgHelmut Grohne
The autoimport tool runs the Python interpreter explicitly. Instead of invoking just "python" and thus calling whatever the current default is, use sys.executable which is the interpreter used to run autoimport, thus locking both to the same Python version.
2016-04-28support Python 3.x in importpkgHelmut Grohne
In Python 2.x, TarInfo.name is a bytes object. In Python 3.x, TarInfo.name always is a unicode object. To avoid importpkg crashing with an exception, we direct the Python 3.x decoding to use surrogateescapes. Thus decoding the name boils down to checking whether it contains surrogates.
2016-04-28decouple a function decompress out of decompress_tarHelmut Grohne
Building on the previous commit, add a decompress function that turns a compressed filelike into a decompressed filelike. Use it to decouple the decompression step.
2016-04-28extend functionality of DecompressedStreamHelmut Grohne
It now supports: * tell() * seek(absolute_position), forward only * close() * closed This is sufficient for putting it as a fileobj into tarfile.TarFile. By doing so we can decouple decompression from tar processing, which eases papering over the Python 2.x vs Python 3.x differences.
2016-04-21importpkg: move the hash function list to the extractor classHelmut Grohne
They really are an aspect of the particular extractor and can easily be changed by subclassing.
2016-04-19add a class DebExtractor for guiding feature extractionHelmut Grohne
It is supposed to separate the parsing of Debian packages (understanding how the format works) from the actual feature extraction. Its goal is to simplify writing custom extractors for different feature sets.