4.2. Arrays and Hashes

Now we've looked at the most common types of scalar, (there are a few complications, which we'll cover in Section 4.3) let's examine array and hash structures. These, too, are build on top of the basic SV structure, with reference counts and flags, and structures hung off sv_any.

4.2.1. Arrays

Arrays are known in the core as AVs. Their structure can be found in av.h:

struct xpvav {
    char*   xav_array;  /* pointer to first array element */
    SSize_t xav_fill;   /* Index of last element present */
    SSize_t xav_max;    /* max index for which array has space */
    IV      xof_off;    /* ptr is incremented by offset */
    NV      xnv_nv;     /* numeric value, if any */
    MAGIC*  xmg_magic;  /* magic for scalar array */
    HV*     xmg_stash;  /* class package */
    SV**    xav_alloc;  /* pointer to malloced string */
    SV*     xav_arylen;
    U8      xav_flags;
};

We're going to skip over xmg_magic and xmg_stash for now, and come back to them in Section 4.3.

Let's use Devel::Peek as before to examine the AV, but we must remember that we can only give one argument to Devel::Peek::Dump; hence, we must pass it a reference to the AV:

% perl -MDevel::Peek -e '@a=(1,2,3); Dump(\@a)'                                                          (1)
SV = RV(0x8106ce8) at 0x80fb380      (1)
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x8105824
  SV = PVAV(0x8106cb4) at 0x8105824  (2)
    REFCNT = 2
    FLAGS = ()
    IV = 0
    NV = 0
    ARRAY = 0x80f7de8                (3)
    FILL = 2                         (4)
    MAX = 3                          (5)
    ARYLEN = 0x0                     (6)
    FLAGS = (REAL)                   (7)
    Elt No. 0
    SV = IV(0x80fc1f4) at 0x80f1460  (8)
      REFCNT = 1
      FLAGS = (IOK,pIOK,IsUV)
      UV = 1
    Elt No. 1
    SV = IV(0x80fc1f8) at 0x80f1574
      REFCNT = 1
      FLAGS = (IOK,pIOK,IsUV)
      UV = 2
    Elt No. 2
    SV = IV(0x80fc1fc) at 0x80f1370
      REFCNT = 1
      FLAGS = (IOK,pIOK,IsUV)
      UV = 3
(1)
We're dumping the reference to the array, which is, as you would expect, an RV.
(2)
The RV contains a pointer to another SV: this is our array; the Dump function helpfully calls itself recursively on the pointer.
(3)
The AV contains a pointer to a C array of SVs. Just like a string, this array must be able to change its size; in fact, the expansion and contaction of AVs is just the same as that of strings.
(4)
To facilitate this, FILL is the highest index in the array. This is usually equivalent to $#array.
(5)
MAX is the maximum allocated size of the array; if FILL has to become more than MAX, the array is grown with av_extend.
(6)
We said that FILL was usually equivalent to $#array, but the exact equivalent is ARYLEN. This is an SV that is created on demand - that is, whenever $#array is read. Since we haven't read $#array in our example, it's currently a null pointer. The distinction between FILL and $#array is important when an array is tied.
(7)
The REAL flag is set on "real" arrays; these are arrays which reference count their contents. Arrays such as @_ and the scratchpad arrays (see below) are fake, and do not bother reference counting their contents as an efficiency hack.
(8)
Devel::Peek::Dump shows us some of the elements of the array; these are ordinary SVs.

Something similar to the offset hack is performed on AVs to enable efficient shifting and splicing off the beginning of the array; while AvARRAY (xav_array in the structure) points to the first element in the array that is visible from Perl, AvALLOC (xav_alloc) points to the real start of the C array. These are usually the same, but a shift operation can be carried out by increasing AvARRAY by one and decreasing AvFILL and AvLEN. Again, the location of the real start of the C array only comes into play when freeing the array. See av_shift in av.c.

4.2.2. Hashes

Hashes are represented in the core as, you guessed it, HVs. Before we look at how this is implemented, we'll first see what a hash actually is...

4.2.2.1. What is a "hash" anyway?

A hash is actually quite a clever data structure: it's a combination of an array and a linked list. Here's how it works:

  1. The hash key undergoes a transformation to turn it into a number called, confusingly, the hash value. For Perl, the C code that does the transformation looks like this: (from hv.h)

        register const char *s_PeRlHaSh = str;
        register I32 i_PeRlHaSh = len; 
        register U32 hash_PeRlHaSh = 0; 
        while (i_PeRlHaSh--) {
            hash_PeRlHaSh += *s_PeRlHaSh++;
            hash_PeRlHaSh += (hash_PeRlHaSh << 10);
            hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6);
        }
        hash_PeRlHaSh += (hash_PeRlHaSh << 3);
        hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11);
        (hash) = (hash_PeRlHaSh += (hash_PeRlHaSh << 15));
    		
    Converting that to Perl and tidying it up:
        sub hash {
            my $string = shift;
            my $hash;
            for (map {ord $_} split //, $string) {
                 $hash += $_; $hash += $hash << 10; $hash ^= $hash >> 6;
            }
            $hash += $hash << 3; $hash ^= $hash >> 1;
            return ($hash + $hash << 15);
        }
    		

  2. This hash is distributed across an array using the modulo operator. For instance, if our array has 8 elements, ("Hash buckets") we'll use $hash_array[$hash % 8]

  3. Each bucket contains a linked list; adding a new entry to the hash appends an element to the linked list. So, for instance, $hash{"red"}="rouge" is implemented similar to

        push @{$hash->[hash("red") % 8]},
            { key   => "red", 
              value => "rouge", 
              hash  => hash("red") 
            };

    Why do we store the key as well as the hash value in the linked list? The hashing function may not be perfect - that is to say, it may generate the same value for "red" as it would for, say, "blue". This is called a hash collision, and, while it is rare in practice, it explains why we can't depend on the hash value alone.

As usual, a picture speaks a thousand words:

4.2.2.2. Hash Entries

Hashes come in two parts: the HV is the actual array containing the linked lists, and is very similar to an AV; the things that make up the linked lists are hash entry structures, or HEs. From hv.h:

/* entry in hash value chain */
struct he {
    HE      *hent_next; /* next entry in chain */
    HEK     *hent_hek;  /* hash key */
    SV      *hent_val;  /* scalar value that was hashed */
};

/* hash key -- defined separately for use as shared pointer */
struct hek {
    U32     hek_hash;   /* hash of key */
    I32     hek_len;    /* length of hash key */
    char    hek_key[1]; /* variable-length hash key */
};

As you can see from the above, we simplified slightly when we put the hash key in the buckets above: the key and the hash value are stored in a separate structure, a HEK.

The HEK stored inside a hash entry represents the key: it contains the hash value and the key itself. It's stored separately so that Perl can share identical keys between different hashes - this saves memory and also saves time calcu.llating the hash value. You can use the macros HeHASH(he) and HeKEY(he) to retrieve the hash value and the key from a HE.

4.2.2.3. Hash arrays

Now to turn to the HVs themselves, the arrays which hold the linked lists of HEs. As we mentioned, these are not too dissimilar from AVs.

% perl -MDevel::Peek -e '%a = (red => "rouge", blue => "bleu"); Dump(\%a);'
SV = RV(0x8106c80) at 0x80f1370                            (1)
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x81057a0
  SV = PVHV(0x8108328) at 0x81057a0   
    REFCNT = 2
    FLAGS = (SHAREKEYS)                                    (2)
    IV = 2
    NV = 0
    ARRAY = 0x80f7748  (0:6, 1:2)                          (3)
    hash quality = 150.0%                                  (4)
    KEYS = 2                                               (5)
    FILL = 2                           
    MAX = 7                                                (3)
    RITER = -1                                             (6)
    EITER = 0x0                                            (6)
    Elt "blue" HASH = 0x8a5573ea                           (7)
    SV = PV(0x80f17b0) at 0x80f1574
      REFCNT = 1
      FLAGS = (POK,pPOK)
      PV = 0x80f5288 "bleu"\0
      CUR = 4
      LEN = 5
    Elt "red" HASH = 0x201ed
    SV = PV(0x80f172c) at 0x80f1460
      REFCNT = 1
      FLAGS = (POK,pPOK)
      PV = 0x80ff370 "rouge"\0
      CUR = 5
      LEN = 6
(1)
As before, we dump a reference to the AV, since Dump only takes one parameter.
(2)
The SHAREKEYS flag means that the key structures, the HEKs can be shared between hashes to save memory. For instance, if we have $french{red} = "rouge"; $german{red} = "rot", the key structure is only created once, and both hashes contain a pointer to it.
(3)
As we mentioned before, there are eight buckets in our hash initially - the hash gets restructured as needed. The numbers in brackets around ARRAY tell us about the population of those buckets: six of them have no entries, and two of them have one entry each.
(4)
The "quality" of a hash is related to how long it takes to find an element, and this is in turn related to the average length of the hash chains, the linked lists attached to the buckets: if there is only one element in each bucket, you can find the key simply by performing the hash function. If, on the other hand, all the elements are in the same hash bucket, the hash is particularly inefficient.
(5)
HvKEYS(hv) returns the number of keys in the hash - in this case, two.
(6)
These two values refer to the hash iterator: when you use, for instance, keys or each to iterate over a hash, Perl uses these values to keep track of the current entry. The "root iterator", RITER, is the array index of the bucket currently being iterated, and the "entry interator", EITER, is the current entry in the hash chain. EITER walks along the hash chain, and when it gets to the end, it increments RITER and looks at the first entry in the next bucket. As we're currently not in the middle of a hash iteration, these are set to "safe" values.
(7)
As with an array, the Dump function shows us some of the elements; it also shows us the hash key: the key for "blue" is 0x3954c8. (You can confirm that this is correct by running hash("blue") using the Perl subroutine given above.)