HowTosAndTutorials/CompilationFuTutorial

Introduction

Questions about compiling, installing, and configuring Ruby, Ruby extensions, and third party dependencies come up quite often on the various Ruby mailing lists and forums. Although this tutorial is not specifically about doing science with Ruby it's a nod to the fact that doing any significant amount of scientific programming with Ruby will almost certainly require you install Ruby extensions and, as we shall soon see, may compel you to compile Ruby itself.

If you already understand the nuances of this:

  export prefix=/usr/local/

  export LD_LIBRARAY_PATH=$prefix/lib

  export LD_RUN_PATH=$prefix/lib        # this one's important!

  tar cvfz ruby-1.8.4.tar.gz

  cd ruby-1.8.4

  ./configure --prefix=$prefix --enable-shared && make && sudo make install

then stop reading. Otherwise, read on, as we sketch out all the particulars you'll need to understand to be a compilation master on any platform.

Why?

You may be asking yourself why you should compile Ruby at all. After all, your package manager may make installing Ruby as easy as

  #
  # installing ruby on rh fedora linux
  #
    yum install ruby

or there may even be a nice one-click installer for your OS. Why not use one and just get going? We aren't here to bad mouth package management systems, but we'll enumerate a few reason we think you'll want to compile Ruby yourself:

  #
  # dump the config
  #
    ruby -r yaml -r rbconfig -e'  y Config::CONFIG  '
  
  #
  # show which compiler was used to build ruby
  #
    ruby -r yaml -r rbconfig -e'  y Config::CONFIG["CC"]  '
  
  #
  # show which which arch ruby was built for 
  #
    ruby -r yaml -r rbconfig -e'  y Config::CONFIG["arch"]  '

What?

In order to streamline this tutorial we'll concern ourselves only with POSIX ([WWW] http://en.wikipedia.org/wiki/POSIX/) like operating systems with a reasonable subset of GNU ([WWW] http://gnu.org/) like tools. If you don't want to bother reading the link or don't care what POSIX and GNU are, suffice it so say this covers nearly every modern computing environment, including Mac and Linux, with the notable exception of MS Windows systems. However, nearly all the following material will apply even there if one simply installs the MSYS ([WWW] http://www.mingw.org/) compiler tool chain for windows, this is by far the easiest solution to setting up an extendible Ruby on MS Windows as it gives you a "minimal system" in which to configure, compile, and install software.

At this point we're going to take a step back in order to describe of few of the central tools and concepts you'll need to understand if you are to become a compilation guru.

The Compiler

There are essentially two way to execute computer programs: interpreting or compiling. Interpreted languages, like Ruby, are simply compiled programs thenselves that read the text of a program and execute instructions - in short, the text reads "jump" and the interpreter executes "jump." A compiled program, by contrast, is a binary instruction set specific to a particular machine's hardware. These kinds of programs are typically created by writing in a language, like C, and then using a "compiler" to transform the program's text into machine specific binary. In fact, this process often takes the intermediate step of first transforming the relatively higher level language into "assembler", the lowest human readable form of computer programing, and then transforming the resulting assembly code into binary. We can sketch out the entire process with a short example:

Say we have a very simple C program:

  ~ > cat a.c
  main() { printf ("hello world\n"); }

We can generate the assembler for this program using

  ~ > gcc -S a.c
  ~ > cat a.s

          .file   "a.c"
          .section        .rodata
  .LC0:
          .string "hello world\n"
          .text
  .globl main
          .type   main, @function
  main:
          pushl   %ebp
          movl    %esp, %ebp
          subl    $8, %esp
          andl    $-16, %esp
          movl    $0, %eax
          addl    $15, %eax
          addl    $15, %eax
          shrl    $4, %eax
          sall    $4, %eax
          subl    %eax, %esp
          subl    $12, %esp
          pushl   $.LC0
          call    printf
          addl    $16, %esp
          leave
          ret
          .size   main, .-main
          .section        .note.GNU-stack,"",@progbits
          .ident  "GCC: (GNU) 3.4.4 20050721 (Red Hat 3.4.4-2)"

This doesn't look too much like our tiny C program, but without knowing a thing about assembler we can say that it's certainly possible that this program prints out "hello world." We can also say that we're glad we don't write in assembler very often any more! ;-)

This assembler code can then be transformed into a binary program using

  ~ > gcc a.s

and then run.

  ~ > a.out
  hello world

Note that on some systems the resulting binary may be called "a.exe" or similar.

The compiler takes about one billion options, none of which are we going to get to into in this tutorial. The only important points to take away here are that compilers turn a programs text into assembler, and then turn that assembler into machine specific binary code. In order to compile Ruby and other packages this is pretty much all you need to know.

The Linker

The most important beast in the compilation lifecycle is the linker. The linker is responsible for putting peices of programs together to make a whole program or library. This is most easily illustrated with an example. Say we have this library

  ~ > cat liba.c
  void say_hello(){ printf("hello world\n"); }

and this main program

  ~ > cat a.c
  main(){ say_hello(); }

then they can be compiled and put together like so

  #
  # build the object file liba.o
  #
    ~ > gcc -c liba.c 

  #
  # 'link' it together with a.c
  #
    ~ > gcc a.c liba.o

  #
  # run the resulting binary program
  #
    ~ > a.out
    hello world

Let's review that. We defined a routine,"say_hello", in one file and used it in another. The way we did this is by compiling liba.c into what's know as an object file. An object file is binary output just like everything else from the compiler. The crucial difference is that there is no "main", or top-level routine and that we've used to '-c' switch to indicated library compilation. Thus, rather than being an executable program we've simply created an intermediate binary package of routines that can be re-used by combining them with other binary files. If this were all there were to linking we could stop there, however, there is a problem with this kind of "static" linking. The problem is that each program that "links" in routines in this was gets a copy of the routine. That is to say, if 42 program all use the say_hello(), routine they will all have that code loaded into memory.

Enter shared libraries. Shared libraries work like this:

  #
  # build the shared object file liba.so 
  #
    ~ > gcc -shared -o liba.so liba.c 

  #
  # 'link' it together with a.c
  #
    ~ > gcc a.c liba.so

  #
  # run the resulting binary program
  #
    ~ > a.out
    a.out: error while loading shared libraries: liba.so: cannot open shared object file: No such file or directory

Oops. That didn't work did it? Here's the first thing you need to know about shared libraries:

  1. They are found at run time

Remember, the linker may already have a copy of the library in memory when the program is run and shared means just that, all programs share one copy of the library, because of this the loading is deferred until runtime when all the outstanding requirements of running programs for a particular library can be fufilled by the linker: it simply loads the library once. This has several implications:

  1. A mechanism is required for the linker to search for libraries. This must work recursively since libraries may require libraries which may require futher libraries, and so on.

  2. Binaries built with shared libraries will much smaller that libraries built with static ones. This may not seem like a big deal unless you consider that every program running on your computer probably has libc.so linked into it!

  3. Binaries built with shared libraries might run differently with respect to two invocations of the program. That is to say by changing a library many programs might be affected since, the next time they run they may well pick up a new copy of the library. A versioning system is required so this kind of behaviour can be controled: sometimes we want to pick up changes, such as in the case of a bug fix, and sometimes we do not, such as in the case of a major code change that changes the parameter order of a function call or makes some other interface change.

Let's go over each of these. Before doing that we'll mention that most of what we'll say about shared libraries applies to dll's on windows too, after all they are the same thing. The biggest difference is simply how the operating system manages them.

How does the linker find libraries? The definitive answer can be found by running

  #
  # always rtfm.  carefully.  all of it.
  #
    ~ > man ld.so
    ~ > man ldd
    ~ > man ldconfig

on a linux like system. Basically libraries are looked for in

This ought to give us a hint as to why our program was failing: it wasn't being found! Here's a fix:

  #
  # set LD_LIBRARY_PATH to the current directory and run a.out
  #
    ~ > LD_LIBRARY_PATH=`pwd` a.out
    hello world

Here we just told the linker, via the environment var LD_LIBRARAY_PATH, to include the current directory in our search for whatever library a.out was looking for. It's worth pointing out that, here, we actually knew which library the program needed. Sometimes that's not so easy. Fortunately we have a commmand which does that for us:

  #
  # show which libraries a.out requies, and where they would be found if we ran the program
  #
    ~ > ldd a.out
          linux-gate.so.1 =>  (0xb7f58000)
          liba.so => not found
          libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
          /lib/ld-linux.so.2 (0x00681000)

Well. That's pretty obvious output isn't it? Here's another run, this time with LD_LIBRARAY_PATH set:

  #
  # show which libraries a.out requies, and where they would be found if we ran the program
  #
    ~ > LD_LIBRARY_PATH=`pwd` ldd a.out
          linux-gate.so.1 =>  (0xb7fa8000)
          liba.so => /home/ahoward/sciruby/liba.so (0xb7fa5000)
          libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
          /lib/ld-linux.so.2 (0x00681000)

Hopefully the function of LD_LIBRARAY_PATH is clear: it's like PATH, but for shared libraries instead of executables.

Now this is important: if you compile code against libraries that are not in the normal places the linker looks for libraries you will need to have LD_LIBRARAY_PATH set when the code is run! Many programs, like firefox, accomplish this by making users run the program through a shell script:

  #
  # show what the firefox program is
  #
    ~ > file `which firefox`
    /usr/bin/firefox: Bourne shell script text executable
  #
  # see what it's doing with LD_LIBRARY_PATH
  #
    ~ > grep LD_LIBRARY_PATH `which firefox`
    ## Set LD_LIBRARY_PATH
    if [ "$LD_LIBRARY_PATH" ]
      LD_LIBRARY_PATH=$MOZ_DIST_BIN:$MOZ_DIST_BIN/plugins:$LD_LIBRARY_PATH
      LD_LIBRARY_PATH=$MOZ_DIST_BIN:$MOZ_DIST_BIN/plugins
    export LD_LIBRARY_PATH

Personally, I think that's kind of weak. It would be nice if things just worked wouldn't it? It can:

  #
  # compile with LD_RUN_PATH set
  #
    ~ > LD_RUN_PATH=`pwd` gcc a.c liba.so
  #
  # see where ldd thinks it'll find liba.so when a.out is run
  #
    ~ > ldd a.out
            linux-gate.so.1 =>  (0xb7f6c000)
            liba.so => /home/ahoward/sciruby/liba.so (0xb7f69000)
            libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
            /lib/ld-linux.so.2 (0x00681000)
  #
  # note the code runs, even without LD_LIBRARAY_PATH set!
  #
    ~ > a.out
    hello world

Viola. This is what LD_RUN_PATH does. It's similar to LD_LIBRARAY_PATH, only the linker applies it at compile time, not run time. How does it do this? It simply encodes the location of the library used to build the program into the program itself for later reference. Basically it stores a hint for itself. It really is just a hint too, if a user sets LD_LIBRARAY_PATH it will override any setting created via LD_RUN_PATH, as it well should. The beauty of LD_RUN_PATH really comes into play when a big compilcated compile links in 20 libraries, each of which are not installed in standard locations - with LD_RUN_PATH the locations can be encoded into the binary and all dependancies will automatically be found.

As mentioned above shared libraries result in much smaller binaries. You may have noticed that the a.out progam has shared library dependancies whether or not it was explicitly compiled against any shared libs:

  #      
  # build a.out using a shared lib      
  #      
    ~ > LD_RUN_PATH=`pwd` gcc a.c liba.so
  #
  # view it's shared library dependacies
  #
    ~ > ldd a.out
            linux-gate.so.1 =>  (0xb7f38000)
            liba.so => /home/ahoward/sciruby/liba.so (0xb7f35000)
            libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
            /lib/ld-linux.so.2 (0x00681000)
  #
  # build a.out using a static libs
  #
    ~ > gcc a.c liba.o
  #
  # view it's shared library dependacies
  #
    ~ > ldd a.out
            linux-gate.so.1 =>  (0xb7f94000)
            libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
            /lib/ld-linux.so.2 (0x00681000)

Shared libraries are such an important feature of modern operating systems that there is no way around them: they are the default for nearly all system functionality. Imagine if every program on your computer had it's own copies of linux-gate.so.1, libc.so.6, and /lib/ld-linux.so.2! Every program on your computer is going to have at least a few shared library dependancies like this. Incidentally, a really great way to hork your entire system is to introduce a bug into libc.so - you can see why!

Note the versioning strings in the libraries above. I'm not going to get too deep into how the linker uses versioning to keep things sane, that's already spelled out in detail here

suffice it to say the concept deals with interfaces and implementations. When you link against code that supports a certain interface, for example a particular function signature, you don't want you code to stop running when that shared library gets updated. The linker and shared library versioning solves this by supporting the concept of linking against some library that will support the functionality we desire. How does it do this? It looks for a library with the same name as the one we linked against, but it uses the newest one that supports the interface we required at compile time - the implementation is allowed to change, that's how bug fixes get seamlessly integrated into running systems when shared libraries are used. What we do not want is the linker to pull in a library with a completely different interface! The links above explain quite well how this is done in practice, but here's an extremely simplified example. Say you compiled a program using:


fortytwo :~/sciruby > cat liba.c
void say_hello(){ printf("hello world\n"); }
fortytwo :~/sciruby > gcc -shared liba.c -o liba.so.1.0
fortytwo :~/sciruby > LD_RUN_PATH=`pwd` gcc a.c liba.so.1.0
fortytwo :~/sciruby > ldd a.out
        linux-gate.so.1 =>  (0xb7f7d000)
        liba.so.1.0 => /home/ahoward/sciruby/liba.so.1.0 (0xb7f7a000)
        libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
        /lib/ld-linux.so.2 (0x00681000)
fortytwo :~/sciruby > a.out
hello world


The Shell

... in progress

Configure

... in progress

Make

... in progress

Links


CategoryTutorial