blob: 2d362f96f8d0e2bbf88b4919772f95e5599b13bc (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
Required packages
-----------------
aptitude install python python-debian python-lzma curl python-jinja2 python-werkzeug sqlite3 python-imaging
Create a database
-----------------
The database name is currently hardcoded as `test.sqlite3`. So copy the SQL
statements from `schema.sql` into `sqlite3 test.sqlite3`. In addition it is
highly recommended to put the database into WAL mode. Otherwise all your
reading queries will block forever when doing an import. This setting is
permanent.
PRAGMA journal_mode = WAL;
Import packages
---------------
Import individual packages by feeding them to importpkg.py:
ls -t /var/cache/apt/archives/*.deb | while read f; do echo $f; ./importpkg.py < $f || break; done
Import a full mirror::
./autoimport.py http://your.mirror.example/debian
Viewing the results
-------------------
Run `./webapp.py` and enjoy a webinterface at `0.0.0.0:8800` or inspect the
SQL database by hand. Here are some example queries.
Finding the 100 largest files shared with multiple packages.
SELECT a.package, a.filename, b.package, b.filename, a.size FROM content AS a JOIN hash AS ha ON a.id = ha.cid JOIN hash AS hb ON ha.hash = hb.hash JOIN content AS b ON b.id = hb.cid WHERE (a.package != b.package OR a.filename != b.filename) ORDER BY a.size DESC LIMIT 100;
Finding those top 100 files that save most space when being reduced to only
one copy in the archive.
SELECT hash, sum(size)-min(size), count(*), count(distinct package) FROM content JOIN hash ON content.id = hash.cid WHERE hash.function = "sha512" GROUP BY hash ORDER BY sum(size)-min(size) DESC LIMIT 100;
Finding PNG images that do not carry a .png file extension.
SELECT package, filename, size FROM content JOIN hash ON content.id = hash.cid WHERE function = "image_sha512" AND filename NOT LIKE "%.png";
Finding .gz files which either are not gziped or contain errors.
SELECT content.package, content.filename FROM content WHERE filename LIKE "%.gz" AND (SELECT count(*) FROM hash WHERE hash.cid = content.id AND hash.function = "gzip_sha512") = 0;
|