Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

Comp.os.research: Frequently answered questions [2/3: l/m 13 Aug 1996]
Section - [2.3.1] File sizes

( Part1 - Part2 - Part3 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum ]


Top Document: Comp.os.research: Frequently answered questions [2/3: l/m 13 Aug 1996]
Previous Document: [2.3] Modern Unix file and block sizes
Next Document: [2.3.2] Block sizes
See reader questions & answers on this topic! - Help others by sharing your knowledge
From: Performance and workload studies

There is no such thing as an average file system.  Some file systems
have lots of little files.  Others have a few big files.  However as a
mental model the notion of an average file system is invaluable.

The following table gives a break down of file sizes and the amount of
space they consume.

   file size       #files  %files  %files   disk space  %space  %space
(max. bytes)                        cumm.         (Mb)           cumm.
           0       147479     1.2     1.2          0.0     0.0     0.0
           1         3288     0.0     1.2          0.0     0.0     0.0
           2         5740     0.0     1.3          0.0     0.0     0.0
           4        10234     0.1     1.4          0.0     0.0     0.0
           8        21217     0.2     1.5          0.1     0.0     0.0
          16        67144     0.6     2.1          0.9     0.0     0.0
          32       231970     1.9     4.0          5.8     0.0     0.0
          64       282079     2.3     6.3         14.3     0.0     0.0
         128       278731     2.3     8.6         26.1     0.0     0.0
         256       512897     4.2    12.9         95.1     0.0     0.1
         512      1284617    10.6    23.5        566.7     0.2     0.3
        1024      1808526    14.9    38.4       1442.8     0.6     0.8
        2048      2397908    19.8    58.1       3554.1     1.4     2.2
        4096      1717869    14.2    72.3       4966.8     1.9     4.1
        8192      1144688     9.4    81.7       6646.6     2.6     6.7
       16384       865126     7.1    88.9      10114.5     3.9    10.6
       32768       574651     4.7    93.6      13420.4     5.2    15.8
       65536       348280     2.9    96.5      16162.6     6.2    22.0
      131072       194864     1.6    98.1      18079.7     7.0    29.0
      262144       112967     0.9    99.0      21055.8     8.1    37.1
      524288        58644     0.5    99.5      21523.9     8.3    45.4
     1048576        32286     0.3    99.8      23652.5     9.1    54.5
     2097152        16140     0.1    99.9      23230.4     9.0    63.5
     4194304         7221     0.1   100.0      20850.3     8.0    71.5
     8388608         2475     0.0   100.0      14042.0     5.4    77.0
    16777216          991     0.0   100.0      11378.8     4.4    81.3
    33554432          479     0.0   100.0      11456.1     4.4    85.8
    67108864          258     0.0   100.0      12555.9     4.8    90.6
   134217728           61     0.0   100.0       5633.3     2.2    92.8
   268435456           29     0.0   100.0       5649.2     2.2    95.0
   536870912           12     0.0   100.0       4419.1     1.7    96.7
  1073741824            7     0.0   100.0       5004.5     1.9    98.6
  2147483647            3     0.0   100.0       3620.8     1.4   100.0

A number of observations can be made:
  - the distribution is heavily skewed towards small files
  - but it has a very long tail
  - the average file size is 22k
  - pick a file at random: it is probably smaller than 2k
  - pick a byte at random: it is probably in a file larger than 512k
  - 89% of files take up 11% of the disk space
  - 11% of files take up 89% of the disk space

Such a heavily skewed distribution of file sizes suggests that, if one
were to design a file system from scratch, it might make sense to
employ radically different strategies for small and large files.

The seductive power of mathematics allows us treat a 200 byte and a
2MB file in the same way.  But do we really want to?  Are there any
problems in engineering where the same techniques would be used in
handling physical objects that span 6 orders of magnitude?

A quote from sci.physics that has stuck with me: `When things change
by 2 orders of magnitude, you are actually dealing with fundamentally
different problems'.

People I trust say they would have expected the tail of the above
distribution to have been even longer.  There are at least some files
in the 1-2G range.  They point out that DBMS shops with really large
files might have been less inclined to respond to a survey like this
than some other sites.  This would bias the disk space figures, but it
would have no appreciable effect on file counts.  The results gathered
would still be valuable because many static disk layout issues are
determined by the distribution of small files and are largely
independent of the potential existence of massive files.

(It should be noted that many popular DBMSs, such as Oracle, Sybase,
 and Informix, use raw disk partitions instead of Unix file systems
 for storing data, hence the difficulty in gathering data about them
 in a uniform way.)

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

CAPTCHA




Top Document: Comp.os.research: Frequently answered questions [2/3: l/m 13 Aug 1996]
Previous Document: [2.3] Modern Unix file and block sizes
Next Document: [2.3.2] Block sizes

Part1 - Part2 - Part3 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
os-faq@cse.ucsc.edu





Last Update March 27 2014 @ 02:12 PM