Age | Commit message (Collapse) | Author |
|
Resolve accumulated conflicts. In particular webapp.py gained a few
non-trivial ones, such as changes in InternalRedirect or usage of
contextlib.closing.
Conflicts:
schema.sql
webapp.py
|
|
wording, more NOT NULLs, some more explanations
|
|
Thanks to Peter Palfrader for explaining what information is needed and
reviewing the documentation.
|
|
|
|
|
|
While the importer can easily cope with this change, the web
presentation still needs fixing. It works somewhat now.
|
|
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
|
|
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
|
|
Actual savings on the full data set are around 7%.
Conflicts:
README
|
|
Currently this is invalid .gz files and png files not named .png.
|
|
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
|
|
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
|
|
One approach to improve performance is to reduce the database size. A
package name takes up 15 bytes in average. A number of a package takes
up two bytes. Multiply that difference with the number of references and
it should be noticeably. A small test set show a reduction by 10%.
|
|
sharing_package_index is a sub-index of sharing_insert_index and
therefore unnecessary.
|
|
The original version had two major drawbacks:
1) The SQL query used would cause a btree sort, so the time waiting
for the first output was rather long.
2) For packages with many equal files, the output would grow with
O(n^2).
Thanks to the suggestions by Christine Grohne and Klaus Aehlig. The
approach now groups files in package1 by their main hash value (sha512).
It also does some work SQL was designed to solve manually now. To speed
up page generation a new caching table was added identifying which files
have corresponding shared files.
|
|
In the old content table (package, filename, size) would be the same for
multiple hash functions. Now the schema represents that each file has
precisely one size, but multiple hashes.
|
|
|
|
In the dependency table we will insert dependencies on packages which
are not tracked. This happens during initial import and for virtual
packages. Therefore the "required" column cannot be a foreign key.
|
|
|
|
|
|
|
|
The sharing table is a cache for the /binary web pages. It essentially
contains the numbers presented. This caching table is not automatically
populated. It needs to be reconstructed after every (group of) package
imports.
|
|
|