Introduction
Questions about compiling, installing, and configuring Ruby, Ruby extensions, and third party dependencies come up quite often on the various Ruby mailing lists and forums. Although this tutorial is not specifically about doing science with Ruby it's a nod to the fact that doing any significant amount of scientific programming with Ruby will almost certainly require you install Ruby extensions and, as we shall soon see, may compel you to compile Ruby itself.
If you already understand the nuances of this:
export prefix=/usr/local/ export LD_LIBRARAY_PATH=$prefix/lib export LD_RUN_PATH=$prefix/lib # this one's important! tar cvfz ruby-1.8.4.tar.gz cd ruby-1.8.4 ./configure --prefix=$prefix --enable-shared && make && sudo make install
then stop reading. Otherwise, read on, as we sketch out all the particulars you'll need to understand to be a compilation master on any platform.
Why?
You may be asking yourself why you should compile Ruby at all. After all, your package manager may make installing Ruby as easy as
#
# installing ruby on rh fedora linux
#
yum install ruby
or there may even be a nice one-click installer for your OS. Why not use one and just get going? We aren't here to bad mouth package management systems, but we'll enumerate a few reason we think you'll want to compile Ruby yourself:
Many (most?) package managers have broken ruby. If you search the ruby-talk archives you'll see something along these lines comes up once per month or so. We don't like broken Rubys.
Even if your package manager manages to grab a decent working binary compiled on some yahoo's virtual machine running whatever hodgepodge set of utils he happen to have been using that day there's no guarantee that the compile options leverage your particular architechture or environment.
No package manger will install a ruby optimized for your system. Ruby isn't the speediest language out there - no point in making it any slower.
Package mangers install a version of Ruby compiled against the libraries of the system the installer was created on: if the kind soul that maintains the Ruby rpm or one-click-installer had, for example, a broken version of the DBM, OpenSSL, or ZLib libraries then you get a version compiled against them. If you have updated versions installed on your system you may or may not get the updated behaviour.
By not compiling Ruby yourself you accept whatever compilation options were given by the package creator. For example, the package manager may or may not have compiled Ruby with --enable-shared. When you compile Ruby yourself you can enable whatever features you need, and even change your mind and enable some later.
Binary installs of Ruby, including any one-click type installers, result in a Ruby that is unable to easily create extensions using your local compiler tool chain. For those of you who don't understand this suffice it to say Ruby has a mechanism for extending itself whereby it uses whatever tools were used to create it in order to create extensions to itself. All the settings used to create Ruby are stored in a file called 'rbconfig.rb' which is installed with any Ruby distribution. You can easily view the contents of this file with commands like these:
#
# dump the config
#
ruby -r yaml -r rbconfig -e' y Config::CONFIG '
#
# show which compiler was used to build ruby
#
ruby -r yaml -r rbconfig -e' y Config::CONFIG["CC"] '
#
# show which which arch ruby was built for
#
ruby -r yaml -r rbconfig -e' y Config::CONFIG["arch"] '
Ruby is an Open Source project. By being so a call has been put out to the community to review the Ruby source code and to submit patches for any bugs found. It's difficult or impossible to modify and test a patch to the Ruby source without having the Ruby sources local and a compiler tool chain capable of compiling them. You may not consider yourself as the kind of person who might contribute to Ruby itself, but this point applies all the more to Ruby extensions, scientific and otherwise, where the author is someone just like yourself donating their time and energy to provide the community with tools to code better and more powerful Ruby. These people may be domain experts but not Ruby or C gurus - it's quite possible to contribute valuable patches, bug reports, tutorials, documentation, and good 'ol comments to these projects. Unless the project happens to pure Ruby, however, you'll be unable to join the club without a compiler and the know how to use it.
Although the docs for Ruby are better than those of many languages (
http://ruby-doc.org) they are not perfect. Sometimes one needs to review the sources for Ruby itself to answer deep questions. Although it's quite easy to install Ruby via a package manager and to download the sources, chances are you'll be running a ruby compiled with slightly different sources than the ones the package managers used when she compiled Ruby. If you learn to compile Ruby you'll know how to install it on any platform: Mac, Linux, and even MS systems. In addition, you'll know how to compile OCaml, Perl, and Python. You'll also know how to compile third party libraries such as the GNU Scientific Library, SQLite, and Image Magick. It's a one time effort with huge payoff across all the platforms you'll likely encounter in the near future - something the effort to learn package manage can never return.
Compiling is accurate and repeatable. Many of the above points suggest something slightly more fundamental about building Ruby from scratch - the process is controlled in a way that provides the most accurate Ruby in a way that can be repeated by someone else. If you are, in fact, doing science with Ruby you should be able to appreciate that knowing exactly how Ruby was built, and that it was built specifically for your hardware, is important from these perspectives.
You may be surprised to hear that compiling Ruby, or any Open Source package, can be fun. If you've ever taken apart a watch, a bicycle, or a car and found enjoyment learning the inside and out of a thing - you'll appreciate building packages from source in much the same way.
Lastly, let us assure you, it's not hard! To compile Ruby on my system, from scratch consists of a quick download, typing two commands, getting up for a cup of coffee, and returning a few minutes later to a shiny new Ruby. It's really not hard stuff.
What?
In order to streamline this tutorial we'll concern ourselves only with POSIX (
http://en.wikipedia.org/wiki/POSIX/) like operating systems with a reasonable subset of GNU (
http://gnu.org/) like tools. If you don't want to bother reading the link or don't care what POSIX and GNU are, suffice it so say this covers nearly every modern computing environment, including Mac and Linux, with the notable exception of MS Windows systems. However, nearly all the following material will apply even there if one simply installs the MSYS (
http://www.mingw.org/) compiler tool chain for windows, this is by far the easiest solution to setting up an extendible Ruby on MS Windows as it gives you a "minimal system" in which to configure, compile, and install software.
At this point we're going to take a step back in order to describe of few of the central tools and concepts you'll need to understand if you are to become a compilation guru.
The Compiler
There are essentially two way to execute computer programs: interpreting or compiling. Interpreted languages, like Ruby, are simply compiled programs thenselves that read the text of a program and execute instructions - in short, the text reads "jump" and the interpreter executes "jump." A compiled program, by contrast, is a binary instruction set specific to a particular machine's hardware. These kinds of programs are typically created by writing in a language, like C, and then using a "compiler" to transform the program's text into machine specific binary. In fact, this process often takes the intermediate step of first transforming the relatively higher level language into "assembler", the lowest human readable form of computer programing, and then transforming the resulting assembly code into binary. We can sketch out the entire process with a short example:
Say we have a very simple C program:
~ > cat a.c
main() { printf ("hello world\n"); }
We can generate the assembler for this program using
~ > gcc -S a.c
~ > cat a.s
.file "a.c"
.section .rodata
.LC0:
.string "hello world\n"
.text
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
subl $12, %esp
pushl $.LC0
call printf
addl $16, %esp
leave
ret
.size main, .-main
.section .note.GNU-stack,"",@progbits
.ident "GCC: (GNU) 3.4.4 20050721 (Red Hat 3.4.4-2)"
This doesn't look too much like our tiny C program, but without knowing a thing about assembler we can say that it's certainly possible that this program prints out "hello world." We can also say that we're glad we don't write in assembler very often any more!
This assembler code can then be transformed into a binary program using
~ > gcc a.s
and then run.
~ > a.out hello world
Note that on some systems the resulting binary may be called "a.exe" or similar.
The compiler takes about one billion options, none of which are we going to get to into in this tutorial. The only important points to take away here are that compilers turn a programs text into assembler, and then turn that assembler into machine specific binary code. In order to compile Ruby and other packages this is pretty much all you need to know.
The Linker
The most important beast in the compilation lifecycle is the linker. The linker is responsible for putting peices of programs together to make a whole program or library. This is most easily illustrated with an example. Say we have this library
~ > cat liba.c
void say_hello(){ printf("hello world\n"); }
and this main program
~ > cat a.c
main(){ say_hello(); }
then they can be compiled and put together like so
#
# build the object file liba.o
#
~ > gcc -c liba.c
#
# 'link' it together with a.c
#
~ > gcc a.c liba.o
#
# run the resulting binary program
#
~ > a.out
hello world
Let's review that. We defined a routine,"say_hello", in one file and used it in another. The way we did this is by compiling liba.c into what's know as an object file. An object file is binary output just like everything else from the compiler. The crucial difference is that there is no "main", or top-level routine and that we've used to '-c' switch to indicated library compilation. Thus, rather than being an executable program we've simply created an intermediate binary package of routines that can be re-used by combining them with other binary files. If this were all there were to linking we could stop there, however, there is a problem with this kind of "static" linking. The problem is that each program that "links" in routines in this was gets a copy of the routine. That is to say, if 42 program all use the say_hello(), routine they will all have that code loaded into memory.
Enter shared libraries. Shared libraries work like this:
#
# build the shared object file liba.so
#
~ > gcc -shared -o liba.so liba.c
#
# 'link' it together with a.c
#
~ > gcc a.c liba.so
#
# run the resulting binary program
#
~ > a.out
a.out: error while loading shared libraries: liba.so: cannot open shared object file: No such file or directory
Oops. That didn't work did it? Here's the first thing you need to know about shared libraries:
They are found at run time
Remember, the linker may already have a copy of the library in memory when the program is run and shared means just that, all programs share one copy of the library, because of this the loading is deferred until runtime when all the outstanding requirements of running programs for a particular library can be fufilled by the linker: it simply loads the library once. This has several implications:
A mechanism is required for the linker to search for libraries. This must work recursively since libraries may require libraries which may require futher libraries, and so on.
Binaries built with shared libraries will much smaller that libraries built with static ones. This may not seem like a big deal unless you consider that every program running on your computer probably has libc.so linked into it!
Binaries built with shared libraries might run differently with respect to two invocations of the program. That is to say by changing a library many programs might be affected since, the next time they run they may well pick up a new copy of the library. A versioning system is required so this kind of behaviour can be controled: sometimes we want to pick up changes, such as in the case of a bug fix, and sometimes we do not, such as in the case of a major code change that changes the parameter order of a function call or makes some other interface change.
Let's go over each of these. Before doing that we'll mention that most of what we'll say about shared libraries applies to dll's on windows too, after all they are the same thing. The biggest difference is simply how the operating system manages them.
How does the linker find libraries? The definitive answer can be found by running
#
# always rtfm. carefully. all of it.
#
~ > man ld.so
~ > man ldd
~ > man ldconfig
on a linux like system. Basically libraries are looked for in
Well known places like say, /usr/lib/.
Other places. These are normally listed in something like /etc/ld.so.conf. Like most unix things this is simply a short text file, in this case a list of extra places to look.
In places specified by the enironment. The main two ENV vars affecting this are LD_LIBRARAY_PATH and LD_RUN_PATH. We'll have more to say about each of these shortly.
This ought to give us a hint as to why our program was failing: it wasn't being found! Here's a fix:
#
# set LD_LIBRARY_PATH to the current directory and run a.out
#
~ > LD_LIBRARY_PATH=`pwd` a.out
hello world
Here we just told the linker, via the environment var LD_LIBRARAY_PATH, to include the current directory in our search for whatever library a.out was looking for. It's worth pointing out that, here, we actually knew which library the program needed. Sometimes that's not so easy. Fortunately we have a commmand which does that for us:
#
# show which libraries a.out requies, and where they would be found if we ran the program
#
~ > ldd a.out
linux-gate.so.1 => (0xb7f58000)
liba.so => not found
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
Well. That's pretty obvious output isn't it? Here's another run, this time with LD_LIBRARAY_PATH set:
#
# show which libraries a.out requies, and where they would be found if we ran the program
#
~ > LD_LIBRARY_PATH=`pwd` ldd a.out
linux-gate.so.1 => (0xb7fa8000)
liba.so => /home/ahoward/sciruby/liba.so (0xb7fa5000)
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
Hopefully the function of LD_LIBRARAY_PATH is clear: it's like PATH, but for shared libraries instead of executables.
Now this is important: if you compile code against libraries that are not in the normal places the linker looks for libraries you will need to have LD_LIBRARAY_PATH set when the code is run! Many programs, like firefox, accomplish this by making users run the program through a shell script:
#
# show what the firefox program is
#
~ > file `which firefox`
/usr/bin/firefox: Bourne shell script text executable
#
# see what it's doing with LD_LIBRARY_PATH
#
~ > grep LD_LIBRARY_PATH `which firefox`
## Set LD_LIBRARY_PATH
if [ "$LD_LIBRARY_PATH" ]
LD_LIBRARY_PATH=$MOZ_DIST_BIN:$MOZ_DIST_BIN/plugins:$LD_LIBRARY_PATH
LD_LIBRARY_PATH=$MOZ_DIST_BIN:$MOZ_DIST_BIN/plugins
export LD_LIBRARY_PATH
Personally, I think that's kind of weak. It would be nice if things just worked wouldn't it? It can:
#
# compile with LD_RUN_PATH set
#
~ > LD_RUN_PATH=`pwd` gcc a.c liba.so
#
# see where ldd thinks it'll find liba.so when a.out is run
#
~ > ldd a.out
linux-gate.so.1 => (0xb7f6c000)
liba.so => /home/ahoward/sciruby/liba.so (0xb7f69000)
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
#
# note the code runs, even without LD_LIBRARAY_PATH set!
#
~ > a.out
hello world
Viola. This is what LD_RUN_PATH does. It's similar to LD_LIBRARAY_PATH, only the linker applies it at compile time, not run time. How does it do this? It simply encodes the location of the library used to build the program into the program itself for later reference. Basically it stores a hint for itself. It really is just a hint too, if a user sets LD_LIBRARAY_PATH it will override any setting created via LD_RUN_PATH, as it well should. The beauty of LD_RUN_PATH really comes into play when a big compilcated compile links in 20 libraries, each of which are not installed in standard locations - with LD_RUN_PATH the locations can be encoded into the binary and all dependancies will automatically be found.
As mentioned above shared libraries result in much smaller binaries. You may have noticed that the a.out progam has shared library dependancies whether or not it was explicitly compiled against any shared libs:
#
# build a.out using a shared lib
#
~ > LD_RUN_PATH=`pwd` gcc a.c liba.so
#
# view it's shared library dependacies
#
~ > ldd a.out
linux-gate.so.1 => (0xb7f38000)
liba.so => /home/ahoward/sciruby/liba.so (0xb7f35000)
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
#
# build a.out using a static libs
#
~ > gcc a.c liba.o
#
# view it's shared library dependacies
#
~ > ldd a.out
linux-gate.so.1 => (0xb7f94000)
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
Shared libraries are such an important feature of modern operating systems that there is no way around them: they are the default for nearly all system functionality. Imagine if every program on your computer had it's own copies of linux-gate.so.1, libc.so.6, and /lib/ld-linux.so.2! Every program on your computer is going to have at least a few shared library dependancies like this. Incidentally, a really great way to hork your entire system is to introduce a bug into libc.so - you can see why!
Note the versioning strings in the libraries above. I'm not going to get too deep into how the linker uses versioning to keep things sane, that's already spelled out in detail here
suffice it to say the concept deals with interfaces and implementations. When you link against code that supports a certain interface, for example a particular function signature, you don't want you code to stop running when that shared library gets updated. The linker and shared library versioning solves this by supporting the concept of linking against some library that will support the functionality we desire. How does it do this? It looks for a library with the same name as the one we linked against, but it uses the newest one that supports the interface we required at compile time - the implementation is allowed to change, that's how bug fixes get seamlessly integrated into running systems when shared libraries are used. What we do not want is the linker to pull in a library with a completely different interface! The links above explain quite well how this is done in practice, but here's an extremely simplified example. Say you compiled a program using:
fortytwo :~/sciruby > cat liba.c
void say_hello(){ printf("hello world\n"); }
fortytwo :~/sciruby > gcc -shared liba.c -o liba.so.1.0
fortytwo :~/sciruby > LD_RUN_PATH=`pwd` gcc a.c liba.so.1.0
fortytwo :~/sciruby > ldd a.out
linux-gate.so.1 => (0xb7f7d000)
liba.so.1.0 => /home/ahoward/sciruby/liba.so.1.0 (0xb7f7a000)
libc.so.6 => /lib/tls/libc.so.6 (0x0069b000)
/lib/ld-linux.so.2 (0x00681000)
fortytwo :~/sciruby > a.out
hello world
The Shell
... in progress
Configure
... in progress
Make
... in progress
Links
http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html
http://www-128.ibm.com/developerworks/linux/library/l-shlibs.html