summaryrefslogtreecommitdiff
blob: bfeafc999ffd71009c0cbb1e0d0d7f61fc475979 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
GLEP: 75
Title: Split distfile mirror directory structure
Author: Michał Górny <mgorny@gentoo.org>,
        Robin H. Johnson <robbat2@gentoo.org>
Type: Standards Track
Status: Final
Version: 1
Created: 2018-01-26
Last-Modified: 2019-11-11
Post-History: 2018-01-27, 2019-10-24
Content-Type: text/x-rst
---

Abstract
========
This GLEP describes the procedure for splitting the distfiles on mirrors
into multiple directories with the goal of reducing the number of files
in a single directory.


Motivation
==========
At the moment, both the package manager and Gentoo mirrors use flat
directory structure to store files.  While this solution usually works,
it does not scale well.  Directories with large number of files usually
have significant performance penalty, unless using filesystems
specifically designed for that purpose.

According to the Gentoo repository state at 2018-01-26 16:23, there
was a total of 62652 unique distfiles in the repository.  While
the users realistically hit around 10% of that, distfile mirrors often
hold even more files — more so if old distfiles are not wiped
immediately.

While all filesystems used on Linux boxes should be able to cope with
a number that large, they may suffer a performance penalty with even
a few thousand files.  Additionally, if mirrors enable directory indexes
then generating the index imposes both a significant server overhead
and a significant data transfer.  At this moment, the index
of distfiles.gentoo.org has around 17 MiB.

Splitting the distfiles into multiple directories makes it possible
to avoid those problems by reducing the number of files in a single
directory.  For example, splitting the forementioned set of distfiles
into 16 directories that are roughly balanced allows to reduce
the number of files in a single directory to around 4000.  Splitting
them further into 256 directories (16x16) results in 200-300 files
per directory which should avoid any performance problems long-term,
even assuming 300% growth of number of distfiles.


Specification
=============
Mirror layout file
------------------
A mirror adhering to this specification should include a ``layout.conf``
file in the top distfile directory.  This file uses the format
derived from the freedesktop Desktop Entry Specification file format
[#DESKTOP_FORMAT]_.

Before using each Gentoo mirror, the package manager should attempt
to fetch (update) its ``layout.conf`` file and process it to determine
how to use the mirror.  If the file is not present, the package manager
should behave as if it were empty.

The package manager should recognize the sections and keys listed below.
It should ignore any unrecognized sections or keys — the format
is intended to account for future extensions.

This specification currently defines one section: ``[structure]``.
This section defines one or more repository structure definitions
using non-negative sequential integer keys.  The definition with
the ``0`` key is the most preferred structure.  The package manager
should ignore any formats it does not recognize.  If this section
is not present, the package manager should behave as if only ``flat``
structure were specified.

The following structure definitions are supported:

* ``flat`` to indicate the traditional flat structure where all
  distfiles are located in the top directory,

* ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
  hash structure`_ explained below.


Filename hash structure
-----------------------
When using the filename hash structure, the distfiles are split
into directories whose names are derived from the hash of distfile
filename.  This structure has two parameters: *algorithm name*
and *cutoffs* list.

The algorithm name must correspond to a valid Manifest hash name.
An informational list of hashes is included in GLEP 74 [#GLEP74]_,
and the policies for introducing new hashes are covered by GLEP 59
[#GLEP59]_.

The cutoffs list specifies one or more integers separated by colons
(``:``), indicating the number of bits (starting with the most
significant bit) of the hash used to form subsequent subdirectory names.
For example, the list of ``4:8`` would indicate that top-level directory
names are formed using 4 most significant bits of the hash (resulting
in 2⁴ = 16 directories), and each of these directories would have
subdirectories formed using the next 8 bits of the hash (resulting
in 2⁸ = 256 subdirectories each).

Implementations are only required to support cutoffs being multiples
of 4.  Support for other values is optional.

The exact algorithm for determining the distfile location follows:

1. Let the distfile filename be **F**.

2. Compute the hash of **F** and store its binary value as **H**.

3. For each integer **C** in cutoff list:

   a. Remove the **C** most significant bits from **H** and store them
      as **V**.

   b. Convert **V** into hexadecimal notation (digits ``0`` to ``9``
      and lowercase letters ``a`` to ``f``), left padded with zeros
      to **C/4** digits (rounded up), and append it to the path,
      followed by the path separator.

4. Finally, append **F** to the obtained path.

In particular, note that when using nested directories
the subdirectories do not repeat the hash bits used in parent directory.


Migrating mirrors to the hashed structure
-----------------------------------------
Since all distfile mirrors sync to the master Gentoo mirror, it should
be enough to perform all the needed changes on the master mirror
and wait for other mirrors to sync.  The following procedure
is recommended:

1. Include the initial ``layout.conf`` listing only ``flat`` layout.

2. Create the new structure alongside the flat structure. Wait for
   mirrors to sync.

3. Once all mirrors receive the new structure, update ``layout.conf``
   to list the ``filename-hash`` structure.

4. Once a version of Portage supporting the new structure is stable long
   enough, remove the fallback ``flat`` structure from ``layout.conf``
   and duplicate distfiles.

This implies that during the migration period the distfiles will
be stored duplicated on the mirrors and therefore will occupy twice
as much space.  Technically, this could be avoided either by using
hard links or symbolic links.

The hard link solution allows us to save space on the master mirror.
Additionally, if ``-H`` option is used by the mirrors it avoids
transferring existing files again.  However, this option is known
to be expensive and could cause significant server load.  Without it,
all mirrors need to transfer a second copy of all the existing files.

The symbolic link solution could be more reliable if we could rely
on mirrors using the ``--links`` rsync option.  Without that, symbolic
links are not transferred at all.


Using hashed structure for local distfiles
------------------------------------------
The hashed structure defined above could also be used for local distfile
storage as used by the package manager.  For this to work, the package
manager authors need to ensure that:

a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
   directory where distfiles specific to the package are linked
   in a flat structure.

b. All tools are updated to support the nested structure.

c. The package manager provides a tool for users to easily manipulate
   distfiles, in particular to add distfiles for fetch-restricted
   packages into an appropriate subdirectory.

For extended compatibility, the package manager may support finding
distfiles in flat and nested structure simultaneously.


Rationale
=========
Algorithm for splitting distfiles
---------------------------------
The possible algorithms were considered with the following goals
in mind:

- the number of files in a single directory should not exceed 1000,

- the total size of files in a single directory is not considered
  relevant,

- the solution should preferably be future-proof,

- moving distfiles should be avoided once it is deployed.

It should also be noted that at this moment the package having most
distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
with the number of 8556 distfiles.  All of them start with a common
prefix of ``texlive-module-``.  This specific prefix is used by a total
of 23435 distfiles.

In the original debate that occurred in bug #534528 [#BUG534528]_
and the mailing list review of the initial version of this GLEP [#ML1]_,
four fundamental ideas for splitting distfiles were listed:

a. using initial portion of filename,

b. using initial portion of file hash,

c. using initial portion of filename hash,

d. using package category (and package name).

The initial filename idea was to use the first character of filename,
possibly followed by a longer part which was the idea historically
used e.g. by PyPI Python package hosting.  Its main advantage is
simplicity.  The users can easily determine the correct subdirectory
by just looking at the distfile name.  Sadly, this solution is not only
very uneven but does not solve the problem.  As mentioned above,
the TeΧ Live packages share a long common prefix that make it impossible
to split it properly with other packages on fixed-length prefixes.

This idea has been followed by an adaptive proposal by Andrew Barchuk
[#ADAPTIVE_FILENAME]_.  In this proposal, the filenames are not strictly
mapped to groups by a common prefix but instead each group contains
all files between two prefixes being used (like in a dictionary).
However, it has been pointed out that while this option can provide
very even results initially, it is impossible to predict how it would
be affected by future distfile changes and there will be a risk of
needing to change the groups in the future.  Furthermore, it is
relatively complex and requires explicitly listing or obtaining used
groups.

Another option was to use an initial portion of distfile hashes.  Its
main advantage is that cryptographic hash algorithms can provide
a more balanced split with random data.  Furthermore, since hashes are
stored in Manifests using them has no cost for users.  However, this
solution has three disadvantages:

1. Not all files in the distfile tree are covered by package Manifests.
   Additional files are injected into the mirrors, and those will
   not have a clearly-defined location.

2. User-provided distfiles (e.g. for fetch-restricted packages) with
   hash mismatches would be placed in the wrong subdirectory,
   potentially causing confusing errors.

3. The hash values are unknown for newly-downloaded distfiles, so
   ``repoman`` (or an equivalent tool) would have to use a temporary
   directory before locating the file in appropriate subdirectory.

Using filename hashes has proven to provide a similar balance to using
file hashes.  Furthermore, since filenames are known up front this
solution does not suffer from the listed problems.  While hashes need
to be computed manually, hashing short string should not cause
any performance problems.

Jason Zaman has suggested to use package categories (and package names)
[#PKGNAME]_.  However, this solution has multiple problems:

a. it does not solve the problem for large packages such as TeΧ Live,

b. it introduces many unnecessarily small directories,

c. it requires an explicit knowledge of which package distfiles
   belong to,

d. it does not provide an explicit solution to the problem of distfiles
   shared by multiple packages,

e. it does not provide a solution to the problem of injected distfiles.

All the options considered, the filename hash solution was selected
as one that solves all the forementioned problems while introducing
relatively low complexity and being reasonably future-proof.

.. figure:: glep-0075-extras/by-filename.png

   Distribution of distfiles by first character of filenames
   (note: y axis is on log scale)

.. figure:: glep-0075-extras/by-csum.png

   Distribution of distfiles by first hex-digit of checksum
   (x — content checksum, + — filename checksum)

.. figure:: glep-0075-extras/by-csum2.png

   Distribution of distfiles by two first hex-digits of checksum
   (x — content checksum, + — filename checksum)


Cutoff values
-------------
The original draft allowed any cutoff values.  This was changed since
multiples of 4 are much easier to implement — they can be trivially cut
from hexadecimal representation of the hash value.  This representation
is commonly used by hash function implementations, including the Portage
utility functions, pkgcore utility functions (snakeoil) and ``b2sum``
utility from coreutils.


Layout file
-----------
The presence of control file has been suggested in the original
discussion.  Its main purpose is to let package managers cleanly handle
the migration and detect how to correctly query the mirrors throughout
it.  Furthermore, it makes future changes easier.

The format lines specifically mean to hardcode as little about
the actual algorithm as possible.  Therefore, we can easily change
the hash used or the exact split structure without having to update
the package managers or even provide a compatibility layout.

The file is also open for future extensions to provide additional mirror
metadata.  However, no clear use for that has been determined so far.


Hash algorithm
--------------
The hash algorithm support is fully deferred to the existing code
in the package managers that is required to handle Manifests.
In particular, it is recommended to reuse one of the hashes that are
used in Manifest entries at the time.  This avoids code duplication
and reuses an existing mechanism to handle hash upgrades.

During the discussion, it has been pointed that this particular use case
does not require a cryptographically strong hash and a faster algorithm
could be used instead.  However, given the short length of hashed
strings performance is not a problem, and speed does not justify
the resulting code duplication.

It has also been pointed out that e.g. the BLAKE2 hash family provides
the ability of creating arbitrary length hashes instead of truncating
the standard-length hash.  However, not all implementations of BLAKE2
support that and relying on it could reduce portability for no apparent
gain.


Backwards Compatibility
=======================
Mirror compatibility
--------------------
The mirrored files are propagated to other mirrors as opaque directory
structure.  Therefore, there are no backwards compatibility concerns
on the mirroring side.

Backwards compatibility with existing clients is detailed
in `migrating mirrors to the hashed structure`_ section.  Backwards
compatibility with the old clients will be provided by preserving
the flat structure during the transitional period.

The new clients will fetch the ``layout.conf`` file to avoid backwards
compatibility concerns in the future.  In case of hitting an old mirror,
the package manager will default to the ``flat`` structure.


Package manager storage compatibility
-------------------------------------
The exact means of preserving backwards compatibility in package manager
storage are left to the package manager authors.  However, it is
recommended that package managers continue to support the flat layout
even if it is no longer the default.  The package manager may either
continue to read files from this location or automatically move them
to an appropriate subdirectory.


Reference Implementation
========================
The support for this specification has been implemented in Portage,
as of version 2.3.77.  This includes both fetching distfiles,
and maintaining mirrors via ``emirrordist``.  The implementation
supports both listed layouts, with all hash functions supported
by Portage and cutoffs being multiples of 4.

As of 2019-10-18, the Gentoo Infrastructure team has successfully
deployed the ``filename-hash BLAKE2B 8`` layout on Gentoo mirrors.


References
==========
.. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
   (https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)

.. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
   Checksum algorithms (informational)
   (https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories
   of DISTDIR
   (https://bugs.gentoo.org/534528)

.. [#ML1] Michał Górny.  "[pre-GLEP] Split distfile mirror directory
   structure".  gentoo-dev mailing list, 2018-01-26,
   Message-ID 1517009079.31015.3.camel\@gentoo.org
   (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)

.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
   for each directory computed in a way to have the files distributed evenly',
   gentoo-dev mailing list, 2018-01-28,
   Message-ID 1517172228.2114973.1251027256.0A9C8F3C\@webmail.messagingengine.com
   (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)

.. [#PKGNAME] Jason Zaman's reply including 'using the same dir layout
   as the packages themselves', gentoo-dev mailing list, 2018-01-28,
   Message-ID 20180128070111.GA17078\@meriadoc.perfinion.com
   (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/3.0/.