~helmut/debian-dedup.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2014-02-25	explain what can be done with the new dataconflicts	Helmut Grohne

2014-02-25	record package metadata that describes co-installability	Helmut Grohne
	Specifically all entries in the Conflicts header are saved in the conflict table, all entries in the Provides header are saved in the provide table (to cover conflicts with virtual packages) and packages using dpkg-divert in preinst get a magic "_dpkg-divert" entry in their conflict table. With this metadata it should be possible to compute undeclared file conflicts.
2014-02-23	Merge branch updatesharing-eqclass	Helmut Grohne

2014-02-23	spell check comments	Helmut Grohne

2014-02-23	fix spelling mistake	Helmut Grohne
	Reported-By: Stefan Kaltenbrunner
2014-02-23	webapp: fix eqclass usage in package comparison	Helmut Grohne
	When comparing two packages, objects would be considered duplicates without considering whether the respective hash functions are comparable by checking their equivalence classes. The current set of hash functions does not expose this bug.
2014-02-21	update_sharing: weaken assumptions about db layout	Helmut Grohne
	Hash functions are partitioned into equivalence classes. We are generally only interested in sharing among hash functions with the same equivalence class, but the algorithm would compute any sharing. While the current layout never produces the same hashes for functions in difference equivalence classes (for different output length), that may change in future. Also allow hash functions, that belong to no equivalence class at all (eqclass = NULL) as a means to add additional metadata to content without computing any sharing for it.
2014-02-19	blacklist content rather than hashes	Helmut Grohne
	Otherwise the gzip hash cannot tell the empty stream and the compressed empty stream apart.
2014-02-19	GzipDecompressor: don't treat checksum as garbage trailer	Helmut Grohne

2014-02-19	DecompressedHash should fail on trailing input	Helmut Grohne
	Otherwise all files smaller than 10 bytes are successfully hashed to the hash of the empty input when using the GzipDecompressor. Reported-By: Olly Betts
2013-10-03	work around python-debian's #670679	Helmut Grohne

2013-09-11	webapp: open cursors less often	Helmut Grohne
	On the main instance opening cursors equals initiating a connection. Unfortunately sqlite3.Connection.close does not close filedescriptors. So just open less cursors to leak filedescriptors less often.
2013-09-10	webapp: close database cursors	Helmut Grohne
	Leaking them can result in running out of available filedescriptors.
2013-09-04	webapp: serve static files from /static	Helmut Grohne

2013-09-02	add option -d --database for db path to all scripts	Helmut Grohne

2013-09-02	autoimport: avoid hard coded temporary directory	Helmut Grohne

2013-09-02	importpkg: move library-like parts to dedup.debpkg	Helmut Grohne

2013-08-19	importpkg: don't blacklist boring gzip_sha512 hashes	Helmut Grohne
	* In practise there are very few compressed files with trivial hashes. * Blacklisting these values results in false positives in the gzip issues.
2013-08-16	make debian version_compare available in sql	Helmut Grohne

2013-08-16	webapp templates: add an anchor for file issues	Helmut Grohne

2013-08-02	model comparability as an equivalence relation	Helmut Grohne
	webapp has had a relation hash_functions, that modeled "comparable functions". Images should not be compares to other files, since it makes no sense to store them as the RGBA stream, that is being hashed. This comparability property resembles an equivalence relation. So the function table gains a column eqclass. Each class is represented by a number and functions are statically assigned to these classes. Now the filtering happens in SQL instead of Python.
2013-08-01	support hashing gif images	Helmut Grohne
	* Rename "image_sha512" to "png_sha512". * dedup.image.ImageHash is now a base class for image hashes such as PNGHash and GIFHash. * Enable both hashes in importpkg. * Fix README. * Add new hash combinations to webapp. * Add "gif file not named .gif" to issues in update_sharing. Add redirect for "image_sha512" to webapp for backwards compatibility.
2013-07-30	templates/binary: space between package and compare	Helmut Grohne

2013-07-30	templates: wiki.d.o redirects to https now	Helmut Grohne

2013-07-30	fix update_sharing to work after functionid merge	Helmut Grohne

2013-07-29	importpkg.py: support uncompressed data.tar	Helmut Grohne

2013-07-27	also move the static directory into the dedup package	Helmut Grohne

2013-07-27	move templates to dedup package	Helmut Grohne
	They cluttered webapp.py and now vim can give proper highlighting for the templates.
2013-07-26	verify package hashes when importing via http	Helmut Grohne

2013-07-26	Merge branch functionid	Helmut Grohne
	Actual savings on the full data set are around 7%. Conflicts: README
2013-07-25	display "issues" with files in package view	Helmut Grohne
	Currently this is invalid .gz files and png files not named .png.
2013-07-25	README: foo.PNG is also a valid png name	Helmut Grohne

2013-07-24	readyaml: cache the whole function table	Helmut Grohne
	This should reduce the query bandwidth to the rdbms.
2013-07-23	webapp: make html for index valid	Helmut Grohne

2013-07-23	README: fix typo in query	Helmut Grohne

2013-07-23	webapp: remove unused function	Helmut Grohne

2013-07-23	adapt queries in README to new schema	Helmut Grohne

2013-07-23	schema: reference hash functions by integer key	Helmut Grohne
	This already worked quite well for package.id. On a test data set of 5% size this transformation reduces the database size by about 4%.
2013-07-22	schema: extend content_package_index	Helmut Grohne
	We can avoid a b-tree sort in the package comparison of the web app, if the package index, also provides a size.
2013-07-15	Merge branch 'packageid'	Helmut Grohne

2013-07-12	importpkg: simplify state logic	Helmut Grohne

2013-07-12	importpkg: split process_package to process_control	Helmut Grohne

2013-07-10	schema: reference package table by integer key	Helmut Grohne
	One approach to improve performance is to reduce the database size. A package name takes up 15 bytes in average. A number of a package takes up two bytes. Multiply that difference with the number of references and it should be noticeably. A small test set show a reduction by 10%.
2013-07-10	schema.sql: drop unused index	Helmut Grohne
	sharing_package_index is a sub-index of sharing_insert_index and therefore unnecessary.
2013-07-03	README: explain update_sharing.py	Helmut Grohne

2013-06-23	Merge branch yamlimport	Helmut Grohne
	+ Way faster on multiple cores. + More reliable, cause http connections do not time out when the db blocks. - Way slower on single core with contended io path. No clue why. Still update_sharing.py makes up the bulk of processing time.
2013-06-19	webapp: fix hash example link after git upload	Helmut Grohne
	The git binary changed and so did its hash. Choosing a more stable example now: The GPL-3.
2013-06-11	autoimport: don't fork for readyaml	Helmut Grohne
	This appears to be a huge performance boost.
2013-06-11	autoimport: support processing individual files	Helmut Grohne
	This gets back the original functionality of importpkg.py.
2013-06-10	split the import phase to a yaml stream	Helmut Grohne
	importpkg.py now emits a yaml stream instead of updating the database. The acutual updating now happens in readyaml.py. In this process autoimport.py was significantly reworked to import packages in parallel.