Age | Commit message (Collapse) | Author |
|
html_response expects a str-generator, but when we call the render
method, we receive a plain str. It can be iterated - one character at a
time. That's what encode_and_buffer will do in this case. So better
stream all the time.
|
|
|
|
|
|
|
|
|
|
|
|
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
|
|
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
|
|
Leaking them can result in running out of available filedescriptors.
|
|
|
|
|
|
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
|
|
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
|
|
|
|
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
|
|
Actual savings on the full data set are around 7%.
Conflicts:
README
|
|
Currently this is invalid .gz files and png files not named .png.
|
|
|
|
|
|
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
|
|
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
|
|
The git binary changed and so did its hash. Choosing a more stable
example now: The GPL-3.
|
|
|
|
|
|
Except it doesn't work, so replace it with our version. At least we
might be able to drop this code in a future update.
|
|
Only add rowspan when it carries a meaning.
|
|
|
|
|
|
|
|
|
|
|
|
Suggested by Paul Wise.
|
|
The original version had two major drawbacks:
1) The SQL query used would cause a btree sort, so the time waiting
for the first output was rather long.
2) For packages with many equal files, the output would grow with
O(n^2).
Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.
|
|
|
|
|
|
|
|
|
|
Fails on long inputs.
|
|
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
|
|
The sharing table works great and I don't want to adapt it for the next
step in the schema change.
|
|
|
|
|
|
|
|
Apparently not all browsers understand <a ... /> in all rendering modes.
|
|
|
|
|
|
|
|
Thanks to Jan Luehr for doing the work.
|
|
Makes things more correct when using Application in multiprocessing
context.
|
|
|