summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2014-03-08enable result buffering for postgresHelmut Grohne
2014-03-08restrict sqlite-specific configuration to sqlite databasesHelmut Grohne
2014-03-08autoimport: fix --database option broken in mergeHelmut Grohne
2014-03-08Merge branch 'master' into sqlalchemyHelmut Grohne
In the mean time, the master branch evolved quite a bit and the schema changed again (eqclass added to function table). The main reason for the merge is to resolve the large amounts of conflicts once, so development of the sqlalchemy branch can continue and still benefit from changes in the master branch such as schema compatibility, adapting the indent level in web app due to the use of contextlib.closing which resembles sqlalchemy's "with db.begin() as conn:". Conflicts: autoimport.py dedup/utils.py readyaml.py update_sharing.py webapp.py
2014-03-08schema: make syntax compatible with postgresHelmut Grohne
2014-02-23Merge branch updatesharing-eqclassHelmut Grohne
2014-02-23spell check commentsHelmut Grohne
2014-02-23fix spelling mistakeHelmut Grohne
Reported-By: Stefan Kaltenbrunner
2014-02-23webapp: fix eqclass usage in package comparisonHelmut Grohne
When comparing two packages, objects would be considered duplicates without considering whether the respective hash functions are comparable by checking their equivalence classes. The current set of hash functions does not expose this bug.
2014-02-21update_sharing: weaken assumptions about db layoutHelmut Grohne
Hash functions are partitioned into equivalence classes. We are generally only interested in sharing among hash functions with the same equivalence class, but the algorithm would compute any sharing. While the current layout never produces the same hashes for functions in difference equivalence classes (for different output length), that may change in future. Also allow hash functions, that belong to no equivalence class at all (eqclass = NULL) as a means to add additional metadata to content without computing any sharing for it.
2014-02-19blacklist content rather than hashesHelmut Grohne
Otherwise the gzip hash cannot tell the empty stream and the compressed empty stream apart.
2014-02-19GzipDecompressor: don't treat checksum as garbage trailerHelmut Grohne
2014-02-19DecompressedHash should fail on trailing inputHelmut Grohne
Otherwise all files smaller than 10 bytes are successfully hashed to the hash of the empty input when using the GzipDecompressor. Reported-By: Olly Betts
2013-10-03work around python-debian's #670679Helmut Grohne
2013-09-11webapp: open cursors less oftenHelmut Grohne
On the main instance opening cursors equals initiating a connection. Unfortunately sqlite3.Connection.close does not close filedescriptors. So just open less cursors to leak filedescriptors less often.
2013-09-10webapp: close database cursorsHelmut Grohne
Leaking them can result in running out of available filedescriptors.
2013-09-04webapp: serve static files from /staticHelmut Grohne
2013-09-02add option -d --database for db path to all scriptsHelmut Grohne
2013-09-02autoimport: avoid hard coded temporary directoryHelmut Grohne
2013-09-02importpkg: move library-like parts to dedup.debpkgHelmut Grohne
2013-08-19importpkg: don't blacklist boring gzip_sha512 hashesHelmut Grohne
* In practise there are very few compressed files with trivial hashes. * Blacklisting these values results in false positives in the gzip issues.
2013-08-16make debian version_compare available in sqlHelmut Grohne
2013-08-16webapp templates: add an anchor for file issuesHelmut Grohne
2013-08-03convert remaining code to sqlalchemyHelmut Grohne
No explicit "import sqlite3" left. It's still a bit rough around the corners, particularly since sqlalchemy's support for executemany is totally broken.
2013-08-02model comparability as an equivalence relationHelmut Grohne
webapp has had a relation hash_functions, that modeled "comparable functions". Images should not be compares to other files, since it makes no sense to store them as the RGBA stream, that is being hashed. This comparability property resembles an equivalence relation. So the function table gains a column eqclass. Each class is represented by a number and functions are statically assigned to these classes. Now the filtering happens in SQL instead of Python.
2013-08-02Merge branch master into sqlalchemyHelmut Grohne
This makes the sqlalchemy branch schema-compatible with master again. The biggest change on master was the introduction of the function table. It caused most of the conflicts. Note that webapp had one conflict not detected by git: The selecting of issues in show_package needed sqlalchemy conversion. Conflicts: README update_sharing.py webapp.py
2013-08-01support hashing gif imagesHelmut Grohne
* Rename "image_sha512" to "png_sha512". * dedup.image.ImageHash is now a base class for image hashes such as PNGHash and GIFHash. * Enable both hashes in importpkg. * Fix README. * Add new hash combinations to webapp. * Add "gif file not named *.gif" to issues in update_sharing. * Add redirect for "image_sha512" to webapp for backwards compatibility.
2013-07-30templates/binary: space between package and compareHelmut Grohne
2013-07-30templates: wiki.d.o redirects to https nowHelmut Grohne
2013-07-30fix update_sharing to work after functionid mergeHelmut Grohne
2013-07-29importpkg.py: support uncompressed data.tarHelmut Grohne
2013-07-27also move the static directory into the dedup packageHelmut Grohne
2013-07-27move templates to dedup packageHelmut Grohne
They cluttered webapp.py and now vim can give proper highlighting for the templates.
2013-07-26verify package hashes when importing via httpHelmut Grohne
2013-07-26Merge branch functionidHelmut Grohne
Actual savings on the full data set are around 7%. Conflicts: README
2013-07-25display "issues" with files in package viewHelmut Grohne
Currently this is invalid .gz files and png files not named .png.
2013-07-25README: foo.PNG is also a valid png nameHelmut Grohne
2013-07-24sqlalchemy's fetchmany defaults to being fetchallHelmut Grohne
This voids the benefits of processing rows during row generation as has been observed on postgres.
2013-07-24readyaml: cache the whole function tableHelmut Grohne
This should reduce the query bandwidth to the rdbms.
2013-07-23webapp: make html for index validHelmut Grohne
2013-07-23README: fix typo in queryHelmut Grohne
2013-07-23webapp: remove unused functionHelmut Grohne
2013-07-23adapt queries in README to new schemaHelmut Grohne
2013-07-23schema: reference hash functions by integer keyHelmut Grohne
This already worked quite well for package.id. On a test data set of 5% size this transformation reduces the database size by about 4%.
2013-07-22schema: extend content_package_indexHelmut Grohne
We can avoid a b-tree sort in the package comparison of the web app, if the package index, also provides a size.
2013-07-20another missing sqlalchemy.text wrapperHelmut Grohne
2013-07-20use sqlalchemy.textHelmut Grohne
Without using this wrapper the sql statements are not munged by sqlalchemy. Specifically paramstyle is not translated. For sqlite3 this did not matter, because it allows the changed paramstyle, but for postgres it fails without sqlalchemy.text wrappers.
2013-07-17Merge branch master into sqlalchemyHelmut Grohne
This basically pulls the packageid branch into sqlalchemy. The merge was complex, because many sql statements diverged. The merge brings us one step closer to supporting postgres, because an "INSERT OR REPLACE" was removed from readyaml.py in the packageid branch. Conflicts: update_sharing.py webapp.py
2013-07-15Merge branch 'packageid'Helmut Grohne
2013-07-12importpkg: simplify state logicHelmut Grohne