summaryrefslogtreecommitdiff
path: root/README
blob: db7fde480b0a97a6fad5b722cf052c730561f896 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Required packages
-----------------

    aptitude install python3 python3-arpy python3-debian python3-jinja2 python3-werkzeug sqlite3 python3-pil python3-yaml python3-pkg-resources

Create a database
-----------------
The database name is currently hardcoded as `test.sqlite3`. So copy the SQL
statements from `schema.sql` into `sqlite3 test.sqlite3`. In addition it is
highly recommended to put the database into WAL mode. Otherwise all your
reading queries will block forever when doing an import. This setting is
permanent.

    PRAGMA journal_mode = WAL;

Import packages
---------------
Import individual packages by feeding them to importpkg.py and readyaml.py:

    ./importpkg.py < somepkg.deb | ./readyaml.py

You can import your local apt cache:

    ./autoimport.py /var/cache/apt/archives

Import a full mirror (only http supported):

    ./autoimport.py -n -p http://your.mirror.example/debian

After changing the database, a few tables caching expensive computations need
to be (re)generated. Execute `./update_sharing.py`. Without this step the web
interface will report wrong results.

Viewing the results
-------------------
Run `./webapp.py` and enjoy a webinterface at `0.0.0.0:8800` or inspect the
SQL database by hand. Here are some example queries.

Finding the 100 largest files shared with multiple packages.

    SELECT pa.name, a.filename, pb.name, b.filename, a.size FROM content AS a JOIN hash AS ha ON a.id = ha.cid JOIN hash AS hb ON ha.hash = hb.hash JOIN content AS b ON b.id = hb.cid JOIN package AS pa ON a.pid = pa.id JOIN package AS pb ON b.pid = pb.id WHERE (a.pid != b.pid OR a.filename != b.filename) ORDER BY a.size DESC LIMIT 100;

Finding those top 100 files that save most space when being reduced to only
one copy in the archive.

    SELECT hash, sum(size)-min(size), count(*), count(distinct pid) FROM content JOIN hash ON content.id = hash.cid JOIN function ON hash.fid = function.id WHERE function.name = "sha512" GROUP BY hash ORDER BY sum(size)-min(size) DESC LIMIT 100;

Finding PNG images that do not carry a .png file extension.

    SELECT package.name, content.filename, content.size FROM content JOIN hash ON content.id = hash.cid JOIN package ON content.pid = package.id JOIN function ON hash.fid = function.id WHERE function.name = "png_sha512" AND lower(filename) NOT LIKE "%.png";

Finding .gz files which either are not gziped or contain errors.

    SELECT package.name, content.filename FROM content JOIN package ON content.pid = package.id WHERE filename LIKE "%.gz" AND (SELECT count(*) FROM hash JOIN function ON hash.fid = function.id WHERE hash.cid = content.id AND function.name = "gzip_sha512") = 0;