Age | Commit message (Collapse) | Author |
|
|
|
In the mean time, the master branch evolved quite a bit and the schema
changed again (eqclass added to function table). The main reason for the
merge is to resolve the large amounts of conflicts once, so development
of the sqlalchemy branch can continue and still benefit from changes in
the master branch such as schema compatibility, adapting the indent
level in web app due to the use of contextlib.closing which resembles
sqlalchemy's "with db.begin() as conn:".
Conflicts:
autoimport.py
dedup/utils.py
readyaml.py
update_sharing.py
webapp.py
|
|
|
|
|
|
|
|
Reported-By: Stefan Kaltenbrunner
|
|
When comparing two packages, objects would be considered duplicates
without considering whether the respective hash functions are comparable
by checking their equivalence classes. The current set of hash functions
does not expose this bug.
|
|
Hash functions are partitioned into equivalence classes. We are
generally only interested in sharing among hash functions with the same
equivalence class, but the algorithm would compute any sharing. While
the current layout never produces the same hashes for functions in
difference equivalence classes (for different output length), that may
change in future.
Also allow hash functions, that belong to no equivalence class at all
(eqclass = NULL) as a means to add additional metadata to content
without computing any sharing for it.
|
|
Otherwise the gzip hash cannot tell the empty stream and the
compressed empty stream apart.
|
|
|
|
Otherwise all files smaller than 10 bytes are successfully hashed to the
hash of the empty input when using the GzipDecompressor.
Reported-By: Olly Betts
|
|
|
|
On the main instance opening cursors equals initiating a connection.
Unfortunately sqlite3.Connection.close does not close filedescriptors.
So just open less cursors to leak filedescriptors less often.
|
|
Leaking them can result in running out of available filedescriptors.
|
|
|
|
|
|
|
|
|
|
* In practise there are very few compressed files with trivial hashes.
* Blacklisting these values results in false positives in the gzip
issues.
|
|
|
|
|
|
No explicit "import sqlite3" left. It's still a bit rough around the
corners, particularly since sqlalchemy's support for executemany is
totally broken.
|
|
webapp has had a relation hash_functions, that modeled "comparable
functions". Images should not be compares to other files, since it makes
no sense to store them as the RGBA stream, that is being hashed. This
comparability property resembles an equivalence relation. So the
function table gains a column eqclass. Each class is represented by a
number and functions are statically assigned to these classes. Now the
filtering happens in SQL instead of Python.
|
|
This makes the sqlalchemy branch schema-compatible with master again.
The biggest change on master was the introduction of the function table.
It caused most of the conflicts. Note that webapp had one conflict not
detected by git: The selecting of issues in show_package needed
sqlalchemy conversion.
Conflicts:
README
update_sharing.py
webapp.py
|
|
* Rename "image_sha512" to "png_sha512".
* dedup.image.ImageHash is now a base class for image hashes such as
PNGHash and GIFHash.
* Enable both hashes in importpkg.
* Fix README.
* Add new hash combinations to webapp.
* Add "gif file not named *.gif" to issues in update_sharing.
* Add redirect for "image_sha512" to webapp for backwards
compatibility.
|
|
|
|
|
|
|
|
|
|
|
|
They cluttered webapp.py and now vim can give proper highlighting for
the templates.
|
|
|
|
Actual savings on the full data set are around 7%.
Conflicts:
README
|
|
Currently this is invalid .gz files and png files not named .png.
|
|
|
|
This voids the benefits of processing rows during row generation as has
been observed on postgres.
|
|
This should reduce the query bandwidth to the rdbms.
|
|
|
|
|
|
|
|
|
|
This already worked quite well for package.id. On a test data set of 5%
size this transformation reduces the database size by about 4%.
|
|
We can avoid a b-tree sort in the package comparison of the web app, if
the package index, also provides a size.
|
|
|
|
Without using this wrapper the sql statements are not munged by
sqlalchemy. Specifically paramstyle is not translated. For sqlite3 this
did not matter, because it allows the changed paramstyle, but for
postgres it fails without sqlalchemy.text wrappers.
|
|
This basically pulls the packageid branch into sqlalchemy. The merge was
complex, because many sql statements diverged. The merge brings us one
step closer to supporting postgres, because an "INSERT OR REPLACE" was
removed from readyaml.py in the packageid branch.
Conflicts:
update_sharing.py
webapp.py
|
|
|
|
|
|
|
|
By using the :name syntax inside sql statements, sqlalchemy will replace
the contents with whatever paramstyle the underlying dbapi2 module
needs. In case of psycopg2 the paramstyle is not qmark for instance.
|