Archive-name: os-research/part2
Version: $Revision: 1.22 $
Posting-Frequency: monthly
Last-Modified: Tue Aug 13 21:03:28 1996
URL: http://www.serpentine.com/~bos/os-faq/
Answers to frequently asked questions
for comp.os.research: part 2 of 3
Copyright (C) 1993--1996
Bryan O'Sullivan
TABLE OF CONTENTS
1. Available software
1.1. Where can I find Unix process checkpointing and restoration packages?
1.2. What threads packages are available for me to use?
1.3. Can I use distributed shared memory on my Unix system?
1.4. Where can I find operating systems distributions?
1.4.1. Distributed systems and microkernels
1.4.2. Unix lookalikes
1.4.3. Others
2. Performance and workload studies
2.1. TCP internetwork traffic characteristics
2.2. File system traces
2.3. Modern Unix file and block sizes
2.3.1. File sizes
2.3.2. Block sizes
2.3.3. Inode ratios
3. Papers, reports, and bibliographies
3.1. From where are papers for distributed systems available?
3.2. Where can I find other papers?
3.3. Where can I find bibliographies?
4. General Internet-accessible resources
4.1. Wide Area Information Service (WAIS) and World-Wide Web (WWW) servers
4.2. Refdbms---a distributed bibliographic database system
4.3. Willow -- the information looker-upper
4.4. Computer science bibliographies and technical reports
4.5. The comp.os.research archive
4.6. Miscellaneous resources
5. Disclaimer and copyright
Subject: [1] Available software
From: Available software
This section covers various software packages, operating systems
distributions, and miscellaneous other such items which may be of
interest to the operating systems research community. If you have
written, or know of, some software which you believe would be of
fairly wide interest, please get in touch with the FAQ maintainer with
a view to having a short spiel and availability information included
here.
Subject: [1.1] Where can I find Unix process checkpointing and restoration packages?
From: Available software
- [93-01-21-10-18.30] The Condor system is available via anonymous ftp
from <URL:ftp://ftp.cs.wisc.edu>. Condor works entirely at user
level [no kernel modifications required] but doesn't currently
support interprocess communication, signals, or fork(). Definitely
worth a look.
- Bennet S Yee implemented a `mostly portable' checkpoint and restore
package back around 1987. When the programmer invokes the
checkpoint procedure, it saves the state to a file; when a second
process with the same program (but with different arguments) is
started which calls the restore procedure, it reads the old state
from the file. Available via anonymous ftp from
<URL:ftp://play.trust.cs.cmu.edu/usr/bsy/pub/>.
This package is known to work for Pmaxen, Sun4's, Sun3's, IBM RTs,
and VAXen. Porting it to a new architecture should be relatively
simple -- look at the README file.
Subject: [1.2] What threads packages are available for me to use?
From: Available software
Now that POSIX has arrived at a standard threads interface, it is
expected that all major Unix vendors will soon release conformant
threads packages. Currently, vendor-supplied threads packages vary
widely in the interfaces they provide. Some vendors' packages conform
to various drafts of the POSIX standard, while others provide their
own interfaces.
OS/2, Windows NT and Windows 95 all provide threads interfaces. None
conforms to the POSIX standard, and neither IBM nor Microsoft has
signalled any intention to provide conformant threads interfaces.
- Michael T. Peterson <mtp@big.aa.net> has written a POSIX and DCE
threads package, called PCthreads, for Intel-based Linux systems.
See <URL:http://www.aa.net/~mtp/PCthreads.html> for more information.
- Christopher Provenzano <proven@mit.edu> has written a portable
implementation of draft 8 of the IEEE Pthreads standard. See
<URL:http://www.mit.edu:8001/people/proven/pthreads.html> for further
details, or fetch the software itself from
<URL:ftp://sipb.mit.edu/pub/pthreads>. Currently supported are
i386/i486/Pentium processors running NetBSD 1.0, FreeBSD 1.1, Linux
1.0, and BSDi 1.1; DECstations running Ultrix-4.2; SPARCstations
running SunOS 4.1.3; and HP/PA machines running HP/UX-9.03.
As far as I can see, development of this library has halted (at
least temporarily), and it still contains many serious bugs.
- Georgia Tech's OS group has a fairly portable user-level threads
implementation of the Mach Cthreads package. It is called Cthreads,
and can be found at
<URL:ftp://ftp.cc.gatech.edu/pub/groups/systems/Falcon/>.
It also contains the Falcon integrated monitoring system.
It currently runs under SunOS 4.1.X, Irix 4.0.5, Irix 5.3, AIX
3.2.5, Linux 1.0 and higher, and KSR1 and KSR2. It is a fairly easy
to port to other architectures. Current ports in progress are
Solaris 2.4 and AIX 4.X.
- The POSIX / Ada-Runtime Project (PART) has made available an
implementation of draft 6 of the POSIX 1003.4a Pthreads
specification, which runs under SunOS 4.x; the current release is
version 1.20. Available using anonymous ftp from
<URL:ftp://ftp.cs.fsu.edu/pub/>.
- Elan Feingold has written a threads package called ethreads; I don't
know anything about it, other than that it is available from
<URL:ftp://frmap711.mathp7.jussieu.fr/pub/scratch/rideau/misc/threads/ethreads/ethreads.tgz>.
- Stephen Crane has written a `fairly portable' threads package, which
runs under Sun 3, Sun 4, MIPS/RISCos, Linux, and 386BSD. It is
available via anonymous ftp from
<URL:ftp://dse.doc.ic.ac.uk/rex/>, with documentation in
the same directory named lwp.ps.gz.
- QuickThreads is a toolkit for building threads packages, written by
David Keppel. It is available via anonymous ftp from
<URL:ftp://ftp.cs.washington.edu/pub/qt-001.tar.Z>, with an
accompanying tech report at
<URL:ftp://ftp.cs.washington.edu/tr/1993/05/UW-CSE-93-05-06.PS.Z>.
The code as distributed includes ports for the Alpha, x86, 88000,
MIPS, SPARC, VAX, and KSR1.
- On CONVEX SPP Exemplar machines there is a Compiler Parallel Support
Library (CPSlib), a library of thread management and synchronisation
routines. CPSlib is not compatible with anything else, but the
interface is sufficiently similar to the Solaris threads or pthreads
interface to allow straight porting. One special feature of CPSlib
is the (possible) distiction between "symmetric" and "asymmetric"
parallelism.
A small number of vendors provide DCE threads packages for various
Unix systems.
Subject: [1.3] Can I use distributed shared memory on my Unix system?
From: Available software
- CRL is a simple all-software distributed shared memory system
intended for use on message-passing multicomputers and distributed
systems. CRL 1.0 can be compiled for use on the MIT Alewife
Machine, Thinking Machine's CM-5, and networks of Sun workstations
running SunOS 4.1.3 communicating with one another using TCP and
PVM. Because CRL requires no functionality from the underlying
hardware, compiler, or operating system beyond that necessary to
send and receive messages, porting CRL to other platforms should
prove to be straightforward.
General information about CRL can be found at
<URL:http://www.pdos.lcs.mit.edu/crl>. The CRL 1.0 source
distribution (sources for CRL 1.0 and several applications, user
documentation, and a postscript version of a paper about CRL to
appear in this SOSP later this year) is available at
<URL:http://www.pdos.lcs.mit.edu/crl/source.html>.
- Ron Minnich <rminnich@earth.sarnoff.com> has implemented a
distributed shared memory system called MNFS, which is a modified
version of NFS and runs alongside NFS in the kernel.
Performance is good; page faults under FreeBSD 2.0R run at about the
same speed as NFS (~5.9 milliseconds per page). If you need to
update a page from one host to many clients, it can be done at a
cost of 1.2 milliseconds or so per client. This scales: networks of
128 nodes running MNFS have been set up, and times should improve
over faster LANs than Ethernet.
The MNFS programming model uses mmap'ed files. Programs map files
in and then use them as ordinary memory. Cache consistency of a
page is maintained by the MNFS servers, ensuring that there is only
one writeable copy in the network at a time. The model is not
strongly coherent; read-only copies of a page are only refreshed by
an explicit action on the part of the holder of a writeable page
(using msync). For those who don't like this style of programming,
a parallel C compiler has been retargeted to use MNFS on clusters
and networks of computers running Condor. Both performance and
scalability matched explicitly mmap-coded systems.
The system has been implemented on Sunos 4.1.x, Solaris 2.2 and 2.3,
IRIX 5.2 and 5.3, and AIX 3.2. All of these were legally
encumbered, so the FreeBSD version is currently the only
freely-available implementation.
MNFS is available from <URL:ftp:ftp.sarnoff.com/pub/mnfs>, and may be
installed either as a set of diffs to the FreeBSD 2.0.5R kernel, or
installed in-place. Also included in this directory is a slightly
out-of-date paper on MNFS, and a more current manual.
A Linux port of MNFS is in the works.
Subject: [1.4] Where can I find operating systems distributions?
From: Available software
This section covers the availability of several well-known systems;
the only criterion for inclusion of a system here is that it be of
interest to some segment of the OS research community (commercial
systems will be accepted for inclusion, so long as they are pertinent
to research).
Subject: [1.4.1] Distributed systems and microkernels
From: Available software
See part one of the FAQ for further information on some of the systems
listed below.
- [93-03-31-22-49.53] ACE is the distribution, support and sales
channel for Amoeba. `Due to overwhelming response from non-profit
organisations wishing to obtain Amoeba for their research
activities', VU is offering Amoeba 5.2 to research institutions for
more or less free (via ftp at no charge, or on tape for $500 on
Exabyte or $800 on QIC-24). Amoeba currently supports 68020 and
68030-based VME board machines, as well at i386- and i486-based AT
PCs and Sun 3 and 4 machines.
For further information on `commercial' Amoeba, you can contact ACE
by email at <amoeba@ace.nl>, by phone at +31 20 664 6416, or by
fax at +31 20 675 0389. Universities interested in obtaining a
license should send mail to <amoeba-license@cs.vu.nl>, or fax to
+31 20 642 7705.
- Chorus Systemes has special programmes for universities interested
in using Chorus. For more information on the offerings available,
conditions, and other details, get the following files:
- <URL:ftp://ftp.chorus.fr/pub/>
- <URL:ftp://ftp.chorus.fr/pub/academic/README>
- <URL:ftp://ftp.chorus.fr/pub/academic/offerings>
- The Cronus object-oriented distributed system may be obtained via
ftp from <URL:ftp://pineapple.bbn.com>; email
<cronus-help@bbn.com> for details of the account name and
password. Before attempting to get the Cronus distribution, you
must obtain, via anonymous ftp,
<URL:ftp://pineapple.bbn.com/>. Maintenance,
hotline support, and training for Cronus are available from BBN.
Send email to the above address for information on these, or on
obtaining a commercial license.
- Flux is a Mach-based toolkit for developing operating systems; you
can find more information about it on the Web at
<URL:http://www.cs.utah.edu/projects/flux>.
- Horus is available for research use; contact Ken Birman
<ken@cs.cornell.edu> or Robbert van Renesse
<rvr@cs.cornell.edu> for details.
- Isis has not been publicly available since 1989, but may (I'm not
sure) still be obtained using anonymous ftp from
<URL:ftp://ftp.uu.net> or <URL:ftp://ftp.cs.cornell.edu>. After 1989,
the code was picked up by Isis Distributed Systems, which has
subsequently developed and supported it. The commercial version of
Isis (available `at very low cost' to academic institutions) is
available from the company. Email <info@isis.com> for
information, or call +1-212-979-7729 or +1-607-272-6327.
- Information on obtaining the latest Mach 4 distribution is available
from the University of Utah's Mach 4 pages, at
<URL:http://www.cs.utah.edu/projects/flux/mach4/html/Mach4-proj.html>.
- The Plan 9 distribution is now commercially available for $350; it
consists of a two-volume manual, a CD-ROM with all the sources, and
four PC diskettes comprising a binary-only installation of a fairly
complete version of the system that runs on a PC. For more
information, <URL:http://plan9.att.com/plan9/index.html>; this site
houses ordering information, a browsable copy of all the
documentation, and the PC binary distribution.
Kernels exist for the Sun SLC, Sun4Cs of various types,
NeXTstations, MIPS Magnum 3000, SGI 4D series, AT&T Safari, `a whole
bunch of' PCs, and the Gnot.
Sydney University Basser Department of Computer Science has a port
of Plan 9 underway to the DEC Alpha at the moment. A port to the
Sun 3 has been completed. Contact <plan9info@cs.su.oz.au> for
details.
The Plan 9 user mailing list may be subscribed to by sending mail to
<9fans-request@cse.psu.edu>.
- QNX is available for academic applications through an education
support programme run by QNX Software Systems, whereby QNX systems
can be obtained for educational purposes at very low cost. For
commercial and education availability and pricing, contact:
QNX Software Systems QNX Software Systems
175 Terrence Matthews Cr. Westendstr. 19
Kanata, Ontario K2M 1W8 6000 Frankfurt am Main 1
Canada Germany
1 800 363 9001 +49 69 9754 6156 x299
+1 (613) 591 0931
+1 (613) 591 3579 (fax) +49 69 9754 6110 (fax)
Versions after 4.2 of QNX run on the i386 and later processors, with
a 16-bit kernel included for i286 machines. Native optimisations
and a compiler for the Pentium are also included. Further marketing
information can be obtained on the World Wide Web from
<URL:http://www.qnx.com>.
- The 1.1 Research Distribution of the Spring distributed object
oriented operating system is available. Spring is a highly modular,
object-oriented operating system, which is focused around a uniform
interface definition language (IDL). The system is intrinsically
distributed, with all system interfaces being accessible both
locally and remotely.
The 1.1 Research Distribution adds a number of fixes and
improvements, including a Spring-Java IDL system that facilitates
writing Java applets that can talk across Spring IDL interfaces.
The Spring SRD 1.1 Binary CDROM is $75 to Universities and $750 to
commercial research institutions. This includes all of the software
and documentation necessary for installing, running, and developing
new system modules and applications in Spring. All binaries, IDL
files, development tools, key exemplary sources, and course teaching
materials are included. A standard full source license and source
CDROM is also available for $100 to Universities and $1000 to
commercial research institutions.
For more details and ordering information, see
<URL:http://www.sun.com/tech/projects/spring>.
- [93-02-07-16-03.48] The Sprite Network Operating System is available
on CD-ROM. The disc contains the source code and documentation for
Sprite, a research operating system developed at the University of
California, Berkeley. All the research papers from the Sprite
project are also included on the disc. This software on this disc
is primarily intended for research purposes, and is not really
intended to be used as a production system. Boot images are
provided for Sun SPARCstations and DECstations. The CD-ROM is in
ISO-9660 format with Rock Ridge extensions. The disc contains about
550 megabytes of software.
You can get an overview of the Sprite Project, and a complete list
of what is on this disc, by anonymous ftp from
<URL:ftp://cdrom.com/pub/cdroms/>.
If you would like a CD-ROM please send $25. Add $4.95 if you would
like a caddy too. S&H is $5 (per order, not per disc) for
US/Can/Mex, and $10 for overseas. If you live in California, please
add sales tax. You can send a check or money order, or you can
order with Mastercard/Visa/AmEx.
Bob Bruce <rab@cdrom.com>
Walnut Creek CDROM
1547 Palos Verdes Mall, Suite 260
Walnut Creek, CA 94596
United States
1 800 786-9907 (USA only)
+1 510 947-5996
+1 510 947-1644 (fax)
- VSTa is a copylefted system written by Andrew Valencia
<vandys@cisco.com> which uses ideas from several research
operating systems in its implementation. It is currently in an
`experimental but usable' state, and supports `lots of' POSIX, and
runs on a number of different PC configurations. For further
information, send mail to <vsta-request@cisco.com>, or ftp to
<URL:ftp://ftp.cygnus.com/pub/embedded/vsta>.
[Chorus, Clouds?, Choices?]
Subject: [1.4.2] Unix lookalikes
From: Available software
- FreeBSD is available via ftp from
<URL:ftp://ftp.freebsd.org/pub/>,
<URL:ftp://ftp.cosy.sbg.ac.at/pub/mirror/>, and
<URL:ftp://pdq.coe.montana.edu/pub/mirrors/unix/>. The latest
version is derived from 4.4BSD Lite, and contains many extensions.
See <URL:http://www.freebsd.org> for further information.
- NetBSD is available via ftp from
<URL:ftp://ftp.netbsd.org/pub/NetBSD>, and is also derived from
4.4BSD Lite. See <URL:http://www.netbsd.org> for more information.
- Linux is available via anonymous ftp from
<URL:ftp://tsx-11.mit.edu/pub/>, <URL:ftp://ftp.funet.fi/pub/OS/Linux>,
and <URL:ftp://sunsite.unc.edu/pub/Linux>. It is a freely-distributable
System V compatible Unix, and is covered by the GNU General Public
License. Linux runs almost all PCs with i386 or better CPUs and at
least 4 megabytes of memory. See <URL:http://www.linux.org> for further
details.
- 386BSD is available via ftp from
<URL:ftp://agate.berkeley.edu/pub/> or
<URL:ftp://ftp.uu.net:systems/unix/>. It lies mid-way between
4.3BSD Reno and 4.4BSD internally, and contains no AT&T-copyrighted
code. 386BSD runs on ISA bus PCs with i386 or better CPUs. Use of
386BSD is not recommended, since it is unstable and has long since
been superseded by FreeBSD and NetBSD.
- The Hurd is the GNU operating system, being written by Michael
Bushnell. It is based on Mach 3.0, and should be available on most
systems to which Mach has been ported. A preliminary runnable image
may be fetched from
<URL:ftp://alpha.gnu.ai.mit.edu/gnu/>. Trent
A. Fisher <trent@gnurd.uu.pdx.edu> runs an unofficial Hurd page
at <URL:http://www.cs.pdx.edu/~trent/gnu/hurd.html>.
- Lites is a free 4.4BSD-based Unix server which runs on top of Mach.
Lites provides binary compatibility with 4.4 BSD. NetBSD (0.8, 0.9,
and 1.0), FreeBSD (1.1.5 and 2.0), 386BSD, UX (4.3BSD) and Linux on
the i386 platform. It has also been ported to the pc532, and
PA-RISC. Preliminary ports to the R3000 and Alpha processors have
also been made. For more information, see the Lites home page at
<URL:http://www.cs.hut.fi/lites.html>, and see also
<URL:http://www.cs.utah.edu/projects/flux/lites/html>.
Subject: [1.4.3] Others
From: Available software
[93-03-18-10-19.02] Microsoft is making sources of Windows NT
available under license to universities and research laboratories.
You should have the appropriate officials contact
<ntsrcreq@microsoft.com> to get started on this process.
Patrick Bridges' operating systems home page at
<URL:http://www.cs.arizona.edu/people/bridges/oses.html> is an
excellent source of information on a variety of other operating
systems.
Subject: [2] Performance and workload studies
From: Performance and workload studies
This section covers various different publicly-available traces and
studies, libraries and source distributions, which may be of use.
Subject: [2.1] TCP internetwork traffic characteristics
From: Performance and workload studies
- The Internet Traffic Archive is a moderated repository to support
widespread access to traces of Internet network traffic. The traces
can be used to study network dynamics, usage characteristics, and
growth patterns, as well as providing the grist for trace-driven
simulations. The archive is also open to programs for reducing raw
trace data to more manageable forms, for generating synthetic
traces, and for analyzing traces. The archive is available on the
Web at <URL:http://town.hall.org/Archives/pub/ITA>.
There you will find a description of the archive, its associated
mailing lists, the moderation policy and submission guidelines, and
the contents of the archive (traces and programs).
- [92-10-20-15-04.39] Peter Danzig and Sugih Jamin of USC have made
available a report and a source library which simulates realistic
day-to-day network traffic between nodes. The library, tcplib, `is
motivated by our observation that present-day wide-area tcp/ip
traffic cannot be accurately modeled with simple analytical
expressions, but instead requires a combination of detailed
knowledge of the end-user applications responsible for the traffic
and certain measured probability distributions'.
The technical report and the source library it describes are
available via anonymous ftp from
<URL:ftp://jerico.usc.edu/pub/jamin/tcplib>. All you need to
transfer to use the library are: README, brkdn_dist.h, tcpapps.h,
tcplib.1, and one of libtcp* that matches your setup. You need
tcplib.tar.Z only if you must generate the library yourself. The
file tcplibtr.ps.Z is the PostScript version of the report. The
authors may be contacted at <traffic@excalibur.usc.edu>.
- [93-08-09-15-15.54] Vern Paxson of Lawrence Berkeley Laboratories
has a report available via anonymous ftp which describes analytic
models for wide-area TCP connections based upon a set of wide-area
traffic traces. The report may be obtained from
<URL:ftp://ftp.ee.lbl.gov/WAN-TCP-models.{1,2}.ps.Z>.
- [93-05-13-10-54.09] Vern Paxson also has made available another
report, <URL:ftp://ftp.ee.lbl.gov/WAN-TCP-growth-trends.ps.Z>, which
provides an analysis of the growth trends of a medium-sized research
laboratory's wide-area TCP connections over a period of more than
two years.
Subject: [2.2] File system traces
From: Performance and workload studies
- Randy Appleton <randy@dcs.uky.edu> has a set of filesystem traces
which detail every operation performed during a period of more than
a week (several hundred thousand events). Timestamps on the traces
are accurate to under a millisecond. For more details, contact the
author, or visit <URL:http://www.dcs.uky.edu/~randy/Research/index.html>.
- Chris Ruemmler has done a study on low-level disk access patterns
for a workstation, a server, and a time-shared system which appeared
in the Winter 1993 USENIX proceedings. A copy may be obtained via
anonymous ftp from <URL:ftp://ftp.hpl.hp.com/wilkes/>.
- Stephen Russell <smr@cs.unsw.oz.au> has instrumented the SunOS 4.1.x
kernel running on Sun 3 machines. The system allows time-stamped
event records to be obtained from various points in the kernel.
Events can be categorised (eg, paging, file system, etc), and are
read via pseudo-devices. Ioctl calls allow substreams to be
enabled/disabled, buffer status checked, etc. An external high
resolution timer is used for timestamping.
- [93-05-09-09-23.32] The traces used in `Measurements of a
distributed file system' (SOSP 1991) may be obtained from
<URL:http://now.cs.berkeley.edu/Xfs/SpriteTraces>.
Subject: [2.3] Modern Unix file and block sizes
From: Performance and workload studies
The following sections are lifted more or less verbatim from a number
of traces which were co-ordinated and analysed by Gordon Irlam
<gordoni@home.base.com>. The numbers quoted below are based on Unix
file size data for 12 million files, residing on 1000 file systems,
with a total size of 250 gigabytes.
Further information may be obtained on the World Wide Web at
<URL:http://www.base.com/gordoni/ufs93.html>.
Subject: [2.3.1] File sizes
From: Performance and workload studies
There is no such thing as an average file system. Some file systems
have lots of little files. Others have a few big files. However as a
mental model the notion of an average file system is invaluable.
The following table gives a break down of file sizes and the amount of
space they consume.
file size #files %files %files disk space %space %space
(max. bytes) cumm. (Mb) cumm.
0 147479 1.2 1.2 0.0 0.0 0.0
1 3288 0.0 1.2 0.0 0.0 0.0
2 5740 0.0 1.3 0.0 0.0 0.0
4 10234 0.1 1.4 0.0 0.0 0.0
8 21217 0.2 1.5 0.1 0.0 0.0
16 67144 0.6 2.1 0.9 0.0 0.0
32 231970 1.9 4.0 5.8 0.0 0.0
64 282079 2.3 6.3 14.3 0.0 0.0
128 278731 2.3 8.6 26.1 0.0 0.0
256 512897 4.2 12.9 95.1 0.0 0.1
512 1284617 10.6 23.5 566.7 0.2 0.3
1024 1808526 14.9 38.4 1442.8 0.6 0.8
2048 2397908 19.8 58.1 3554.1 1.4 2.2
4096 1717869 14.2 72.3 4966.8 1.9 4.1
8192 1144688 9.4 81.7 6646.6 2.6 6.7
16384 865126 7.1 88.9 10114.5 3.9 10.6
32768 574651 4.7 93.6 13420.4 5.2 15.8
65536 348280 2.9 96.5 16162.6 6.2 22.0
131072 194864 1.6 98.1 18079.7 7.0 29.0
262144 112967 0.9 99.0 21055.8 8.1 37.1
524288 58644 0.5 99.5 21523.9 8.3 45.4
1048576 32286 0.3 99.8 23652.5 9.1 54.5
2097152 16140 0.1 99.9 23230.4 9.0 63.5
4194304 7221 0.1 100.0 20850.3 8.0 71.5
8388608 2475 0.0 100.0 14042.0 5.4 77.0
16777216 991 0.0 100.0 11378.8 4.4 81.3
33554432 479 0.0 100.0 11456.1 4.4 85.8
67108864 258 0.0 100.0 12555.9 4.8 90.6
134217728 61 0.0 100.0 5633.3 2.2 92.8
268435456 29 0.0 100.0 5649.2 2.2 95.0
536870912 12 0.0 100.0 4419.1 1.7 96.7
1073741824 7 0.0 100.0 5004.5 1.9 98.6
2147483647 3 0.0 100.0 3620.8 1.4 100.0
A number of observations can be made:
- the distribution is heavily skewed towards small files
- but it has a very long tail
- the average file size is 22k
- pick a file at random: it is probably smaller than 2k
- pick a byte at random: it is probably in a file larger than 512k
- 89% of files take up 11% of the disk space
- 11% of files take up 89% of the disk space
Such a heavily skewed distribution of file sizes suggests that, if one
were to design a file system from scratch, it might make sense to
employ radically different strategies for small and large files.
The seductive power of mathematics allows us treat a 200 byte and a
2MB file in the same way. But do we really want to? Are there any
problems in engineering where the same techniques would be used in
handling physical objects that span 6 orders of magnitude?
A quote from sci.physics that has stuck with me: `When things change
by 2 orders of magnitude, you are actually dealing with fundamentally
different problems'.
People I trust say they would have expected the tail of the above
distribution to have been even longer. There are at least some files
in the 1-2G range. They point out that DBMS shops with really large
files might have been less inclined to respond to a survey like this
than some other sites. This would bias the disk space figures, but it
would have no appreciable effect on file counts. The results gathered
would still be valuable because many static disk layout issues are
determined by the distribution of small files and are largely
independent of the potential existence of massive files.
(It should be noted that many popular DBMSs, such as Oracle, Sybase,
and Informix, use raw disk partitions instead of Unix file systems
for storing data, hence the difficulty in gathering data about them
in a uniform way.)
Subject: [2.3.2] Block sizes
From: Performance and workload studies
The last block of a file is normally only partially occupied, and so
as block sizes are increased so too will the the amount of wasted disk
space.
The following historical values for the design of the BSD FFS are
given in `Design and implementation of the 4.3BSD Unix operating
system':
fragment size overhead
(bytes) (%)
512 4.2
1024 9.1
2048 19.7
4096 42.9
Files have clearly gotten larger since then; I obtained the following
results:
fragment size overhead
(bytes) (%)
128 0.3
256 0.6
512 1.1
1024 2.5
2048 5.4
4096 12.3
8192 27.8
16384 61.2
By default the BSD FFS typically uses a 1k fragment size. Perhaps
this size is no longer optimal and should be increased.
(The FFS block size is constrained to be no more than 8 times the
fragment size. Clustering is a good way to improve throughput for
FFS based file systems, but it doesn't do very much to reduce the not
insignificant FFS computational overhead.)
It is interesting to note that even though most files are less than 2K
in size, having a 2K block size wastes very little space, because disk
space consumption is so totally dominated by large files.
Subject: [2.3.3] Inode ratios
From: Performance and workload studies
The BSD FFS statically allocates inodes. By default one inode is
allocated for every 2K of disk space. Since an inode consumes 128
bytes this means that by default 6.25% of disk space is consumed by
inodes.
It is important not to run out of inodes since any remaining disk
space is then effectively wasted. Despite this allocating 1 inode for
every 2K is excessive.
For each file system studied I worked out the minimum sized disk it
could be placed on. Most disks needed to be only marginally larger
than the size of their files, but a few disks, having much smaller
files than average, needed a much larger disk---a small disk had
insufficient inodes.
bytes per overhead
inode (%)
1024 12.5
2048 6.3
3072 4.5
4096 4.2
5120 4.4
6144 4.9
7168 5.5
8192 6.3
9216 7.2
10240 8.3
11264 9.5
12288 10.9
13312 12.7
14336 14.6
15360 16.7
16384 19.1
17408 21.7
18432 24.4
19456 27.4
20480 30.5
Clearly, the current default of one inode for every 2K of data is too
small. Earlier results suggested that allocating one inode for every
5-6k was in some sense optimal, and allocating one inode for every 8k
would only be 0.4% worse. The new data suggests one inode for every
4k is optimal, and allocating one inode for every 8k would be 2.1%
worse.
The analysis technique I used is very sensitive to even a few file
systems with very small files.
The main source of file systems with lots of small files would appear
to be netnews servers. The typical Usenet message would appear to be
1-2k in length. Ignoring such file systems would drastically alter
the conclusions I reach. If, as I believe might already be the case,
news servers are manually tuned to have a lower than normal bytes per
inode ratio, it would then be possible to justify setting the default
ratio much higher.
Clearly it is best if the file system dynamically allocate inodes; I
believe AIX does this for instance. Systems that statically allocate
inodes should probably increase the bytes per inode ratio, but it is
not clear to exactly what value. The engineer in me says `it is
important to play this one conservatively: stick to 6k', the artist
goes `as Chris Torek says: aesthetics, 8k'.
Subject: [3] Papers, reports, and bibliographies
From: Papers, reports, and bibliographies
Network-available documents are listed in this section. I'd like to
see information for obtaining other sets of reports which aren't
electronically-available included here as well, at some stage.
Subject: [3.1] From where are papers for distributed systems available?
From: Papers, reports, and bibliographies
Amoeba
<URL:ftp://ftp.cs.vu.nl/amoeba>
<URL:http://www.cs.vu.nl/vakgroepen/cs/amoeba.html>
<URL:ftp://ftp.cse.ucsc.edu/pub/amoeba>
Arjuna
<URL:ftp://arjuna.ncl.ac.uk/pub/Arjuna>
Choices
<URL:ftp://choices.cs.uiuc.edu/>
Chorus
<URL:ftp://ftp.chorus.fr/pub/>
<URL:ftp://cse.ogi.edu/pub/chorus/reports>
Clouds
<URL:ftp://helios.cc.gatech.edu/pub/>
Cronus
<URL:ftp://pineapple.bbn.com/>
Mungi
<URL:http://i30www.ira.uka.de/projects/cosy/index.html>
ExOS
<URL:http://www.pdos.lcs.mit.edu/exo.html>
Flexmach
<URL:http://www.cs.utah.edu/projects/flexmach>
Fox
<URL:http://www.cs.cmu.edu/afs/cs.cmu.edu/project/fox/mosaic/HomePage.html>
Guide
<URL:ftp://ftp.imag.fr/pub/GUIDE/>
Horus
<URL:ftp://ftp.cs.cornell.edu/pub/Horus>
Isis
<URL:ftp://ftp.cse.ucsc.edu/pub/bib/isis.bib>
<URL:ftp://ftp.cs.cornell.edu/pub>
Mach
<URL:ftp://mach.cs.cmu.edu/>
<URL:http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mach/public/www/mach.html>
<URL:http://riwww.osf.org:8001/os/index.html>
<URL:http://www.cs.utah.edu/projects/flexmach/mach4/html/Mach4-proj.html>
Nebula
<URL:http://www.sys.cse.psu.edu/NEBFS/nebula.html>
PEACE
<URL:http://www.gmd.de/FIRST/peace/peace.html>
Plan 9
<URL:ftp://plan9.att.com/plan9/>
<URL:http://www.ecf.toronto.edu/plan9>
<URL:http://plan9.att.com/plan9/plan9doc>
<URL:http://cooper.edu:9000/~rp/plan9/plan9-info.html>
<URL:ftp://plan9.att.com/plan9/>
RTmach
<URL:http://www.cs.cmu.edu:8001/afs/cs.cmu.edu/project/art-6/www/rtmach.html>
Spring
<URL:http://www.sun.com/technology-research/spring>
SUNMOS / Puma
<URL:http://www.cs.sandia.gov/~rolf/puma/puma.html>
Tigger
X kernel / Scout
<URL:ftp://cs.arizona.edu/pub/xkernel>
<URL:http://www.cs.arizona.eduxkernel/www>
Papers covering Amoeba, Choices, Chorus, Clouds, the Hurd, Guide,
Mach, Mars, NonStop, and Plan 9 are also available via anonymous ftp
from <URL:ftp://ftp.funet.fi/pub/doc/OS>.
[I'd like to find the authoritative home for V---Mars and NonStop are
a bit more obscure, I think; they certainly aren't asked after much]
Subject: [3.2] Where can I find other papers?
From: Papers, reports, and bibliographies
Angel
<URL:ftp://ftp.cs.city.ac.uk/>
Apertos
<URL:http://www.csl.sony.co.jp/project/Apertos>
Cache kernel
<URL:http://www-dsg.stanford.edu/papers/cachekernel/main.html>
Hive
<URL:http://www-flash.stanford.edu/OS>
Mungi
<URL:ftp://ftp.vast.unsw.edu.au/pub/>
KeyKOS
<URL:ftp://cs.dartmouth.edu/pub/sasos/papers/>
Pegasus
<URL:http://www.cl.cam.ac.uk/Research/SRG/pegasus.html>
QNX [93-09-19-22-22.26]
<URL:ftp://ftp.cse.ucsc.edu/pub/qnx>
<URL:ftp://ftp.qnx.com/pub/papers>
<URL:http://www.qnx.com>
Solaris 2.x [93-02-23-12-12.43]
<URL:ftp://opcom.sun.ca/pub/docs/>
<URL:ftp://opcom.sun.ca/pub/docs/>
SPIN
<URL:http://www.cs.washington.edu:80/research/projects/spin/www>
Synthetix
<URL:http://www.cse.ogi.edu/DISC/projects/synthetix>
VSTa
<URL:http://www.cen.uiuc.edu/~jeske/VSTa>
Windows NT [92-09-18-11-46.16]
<URL:ftp://ftp.uu.net/vendor/microsoft/win32-api>
<URL:ftp://ftp.uu.net/vendor/microsoft/isv-communications>
Subject: [3.3] Where can I find bibliographies?
From: Papers, reports, and bibliographies
Distributed shared memory
<URL:http://www.cs.uno.edu/~rasit/dsmbiblio.html>
Load balancing
<URL:ftp://ftp.cse.ucsc.edu/pub/bib/load-balancing.bib>
Mobile computing
<URL:ftp://ftp.comp.lancs.ac.uk/pub/mpg>
Multimedia operating systems [94-04-15-23-29.51]
<URL:ftp://cs.ucsd.edu/pub/>
<URL:ftp://ftp.cse.ucsc.edu/pub/bib/mmos.bib>
Object-oriented operating systems
<URL:ftp://ftp.cse.ucsc.edu/pub/bib/ooos.bib.Z>
<URL:ftp://ftp.inria.fr/INRIA/bib/>
Parallel and distributed I/O
<URL:ftp://ftp.cse.ucsc.edu/pub/bib/io.bib>
Sprite network operating system
<URL:ftp://ftp.cs.berkeley.edu/ucb/sprite/sprite.html>
See also the section on General Net Resources.
[There's quite a lot more at <URL:ftp://ftp.cse.ucsc.edu/pub/bib>, if
anyone wants to add more to this list.]
Subject: [4] General Internet-accessible resources
From: General Internet-accessible resources
This section contains information about a variety of services
available to the OS research community via the Internet.
Subject: [4.1] Wide Area Information Service (WAIS) and World-Wide Web (WWW) servers
From: General Internet-accessible resources
[92-09-21-16-38.23] Loughborough University high-performance
networking and distributed systems archive may be accessed via the
World Wide Web at <URL:http://hill.lut.ac.uk/DS-Archive>. This archive
contains, according to Jon Knight <J.P.Knight@lut.ac.uk>, the
organiser:
- Technical reports and papers written at LUT by the networks and
distributed systems researchers in the Department of Computer
Studies.
- Technical reports, papers and theses which have been produced at
other sites and then made available for public electronic access.
- Software which is of use in research or which has been produced by a
specific research project.
- Details of relevant conferences, collected from a variety of sources
(USENET, email, flyers, etc).
- Information on ongoing research projects.
- Bibliographies that have been generated for research at LUT and also
access to other WAIS indexed bibliographies, both at LUT and
elsewhere.
- A list of contacts in the field, with details of their research
interests. This is entirely voluntary (i.e. people have agreed to
Jon entering their details rather than him just rooting round the
Internet to build up the information).
Bibliographies in the comp.os.research collection are accessible via
WAIS from UCSC.
(:source
:version 3
:ip-address "128.114.134.19"
:ip-name "ftp.cse.ucsc.edu"
:tcp-port 210
:database-name "os-bibliographies"
:cost 0.00
:cost-unit :free
:maintainer "paul@cse.ucsc.edu"
:description "Server created with WAIS release 8 b5
on Jul 9 22:38:27 1992 by paul@cse.ucsc.edu
The files of type bibtex used in the index
were: /home/ftp/pub/bib"
)
Subject: [4.2] Refdbms---a distributed bibliographic database system
From: General Internet-accessible resources
[92-10-01-11-39.32] The 13th alpha release of refdbms version 3,
developed by John Wilkes of the Concurrent Systems Project at
Hewlett-Packard Laboratories and Richard Golding of the Concurrent
Systems Laboratory at UC Santa Cruz, is now available. It can be
obtained by anonymous ftp from <URL:ftp://ftp.cse.ucsc.edu/pub/refdbms>.
The system has been tested on Sun 3 and 4 systems running SunOS 4.1.x,
and on DECstations running Ultrix 4.1. It is an experiment in
building weak-consistency wide-area distributed applications, and the
databases currently available for the system have a good systems
coverage.
The system includes tools to query the database, to produce
bibliographies for LaTeX documents, and to enter new references into
the database. It is part of ongoing research into wide-area
distributed information systems on the Internet.
Features include:
- Distributed databases: a reference database can be shared among
multiple sites. Updates can be entered at any site, and will be
propagated to the other sites holding a replica of the database.
- Multiple databases: every database has a name, and users specify the
order in which databases will be searched.
- Private databases: databases can be private, available site-wide, or
they can be made available to other sites.
- Database query by keyword, author, and title word.
- Translator for refer-format databases.
- Usable with LaTeX documents: the internal refdbms format can be
translated into a special BibTeX format.
An up-to-date list of bibliographies exported by various institutions
may be obtained using anonymous ftp from
<URL:ftp://ftp.cse.ucsc.edu/pub/refdbms/current-databases>.
Subject: [4.3] Willow -- the information looker-upper
From: General Internet-accessible resources
The University of Washington's Willow system provides a Motif-based
user interface to a heterogeneous collection of on-line bibliographic
databases. It will compile and run on most systems which provide a
Motif library.
For further information, see the Willow home page at
<URL:http://www.cac.washington.edu/willow/home.html>.
Subject: [4.4] Computer science bibliographies and technical reports
From: General Internet-accessible resources
- A collection of bibliographies in various fields of computer science
is available via anonymous ftp and the World Wide Web. The
bibliographies contain about 260,000 references, most of which are
references to journal articles, conference papers or technical
reports. The collection has been formed by using various freely
accessible services in the Internet (anonymous ftp, mailserver,
wais, telnet) and converting each bibliography into a uniform BibTeX
format. It is organised in files containing references to a (more
or less) specific area within computer science.
The database has been organised by Alf-Christian Achilles
<achilles@ira.uka.de>. It may be accessed on the Web at
<URL:http://liinwww.ira.uka.de/bibliography/index.html>, via ftp from
<URL:ftp://ftp.cs.umanitoba.ca/pub/bibliographies>, and through a
more useful search mechanism on the Web at
<URL:http://glimpse.cs.arizona.edu/1994/bib>.
- As part of the ARPA Electronic Library Project, the Database Group
at Stanford is providing a Selective Dissemination of Information
(SDI) service to disseminate information about computer science
technical reports. You can have a server email you periodic
announcements of new papers on topics that interest you.
See <URL:http://cs-tr.cs.cornell.edu/Info/cstr.html> for details, or
contact Tak Yan <tyan@cs.stanford.edu> or the mail server itself
at <elib@db.stanford.edu>.
Subject: [4.5] The comp.os.research archive
From: General Internet-accessible resources
[93-02-18-21-18.31] An archive of all messages posted to
comp.os.research since 1988 is maintained at UC Santa Cruz. It may be
accessed via anonymous ftp at
<URL:ftp://ftp.cse.ucsc.edu/pub/comp.os.research>. The archive is
organised by year.
Postings may also be found via WAIS at UCSC's Computer Science gopher
hole:
(:source
:version 3
:ip-address "128.114.134.19"
:ip-name "ftp.cse.ucsc.edu"
:tcp-port 210
:database-name "comp-os-research"
:cost 0.00
:cost-unit :free
:maintainer "paul@cse.ucsc.edu"
:description "Server created with WAIS release 8 b5
on Jul 9 03:51:11 1992 by paul@cse.ucsc.edu
The files of type netnews used in the index
were: /home/ftp/pub/comp.os.research"
)
Subject: [4.6] Miscellaneous resources
From: General Internet-accessible resources
- Paul Harrington <phrrngtn@dcs.st-andrews.ac.uk> maintains a World
Wide Web page on checkpointing, at
<URL:http://warp.dcs.st-and.ac.uk/warp/systems/checkpoint>.
- Jay Lepreau <lepreau@cs.utah.edu> has made available an
electronic version of the proceedings of OSDI '94 at
<URL:http://www.cs.utah.edu/~lepreau/osdi94>. Available are such
things as
- Papers: abstracts, papers, slides, bibtex entries,
and for most, the actual software.
- Keynote: audio and slides
- Extensible OS panel: audio, slides, project URLs
- Insularity panel: audio
- Mach/Chorus workshop: TRs for most, slides, some software
- Tutorials: slides for half, descriptions for all
- Miscellaneous: summary report from ;login, list of works-in-progress
talks, hard-copy proceedings ordering info, CFP,
proceedings introduction, list of referees.
Subject: [5] Disclaimer and copyright
From: Disclaimer and copyright
Note that this document is provided as is. The information in it is
not warranted to be correct; you use it at your own risk.
Following recent reports on the <faq-maintainers@mit.edu> list I
think it wise to change the copyright:
NOTICE OF COPYRIGHT AND PERMISSIONS
Answers to Frequently Asked Questions for comp.os.research (hereafter
referred to as These Articles) are Copyright (C) 1993, 1994, 1995, and 1996
by Bryan O'Sullivan <bos@serpentine.com>. They may be reproduced and
distributed in whole or in part, subject to the following conditions:
- This copyright and permission notice must be retained on all
complete or partial copies of These Articles.
- These Articles may be copied or distributed in part or in full for
personal or educational use. Any translation, derivative work, or
copies made for other purposes must be approved by the copyright
holder before distribution, unless otherwise stated.
- If you distribute These Articles, instructions for obtaining the
complete current versions of them free or at cost price must be
included. Redistributors must make reasonable efforts to maintain
current copies of These Articles.
Exceptions to these rules may be granted, and I shall be happy to
answer any questions about this copyright notice -- write to Bryan
O'Sullivan, PO Box 62215, Sunnyvale, CA 94088-2215, USA or email
<bos@serpentine.com>. These restrictions are here to protect the
contributors, not to restrict you as educators and learners.
|