Following is an email I got from Rick Nooner about when to use high level languages like Ruby
Subject: Thoughts on performance and garbage collected languages
Ara,
Here are some thoughts about what we talked about last Monday. These
thoughts are not new and have been expressed in various forms
throughout my career by many different people.
Why use a garbage collected language, i.e. why use a higher level
language?
The short answer is because it helps get the job done quicker with
less errors.
Truth of the matter is that we all write about the same number of
lines of code with approximately the same error ratio no matter what
language we write in. To become more productive, we have to increase
the amount of work done per line of code and make errors easier to find.
Higher level languages have more functionality per line of code than
lower level languages. The tradeoff is execution speed. Most
programs are not CPU bound so this isn't an issue. For those
programs that are CPU bound, there are various ways to still leverage
the benefits of a higher level language while optimizing the
execution speed to an acceptable level.
First, we must be wary of premature optimization. When execution
speed is a problem, rather than taking stabs in the dark, the
application should be profiled so that we can accurately understand
where time is spent. Only then, can we make intelligent decisions
regarding our optimization strategies. Most of the time, appropriate
choices of algorithms will solve the problem.
For example, several years ago we had an application that performed
routing optimization for lambdas on our fiber optic backbone across
the US and Europe using Dykstra's least cost routing algorithm with
data stored in an Oracle database. This application was written in
Java and it took hours to find the correct routes. Routes needed to
be found within 5 seconds or less. My team was asked to come in and
fix the problem by rewriting the application in C++.
However, after looking at the application, it became clear that it
was really database bound and that simply rewriting it in C++
wouldn't solve the performance problem. We still rewrote the
application but we rewrote it in Java (for various reasons our only
choices was C++ or Java) but we completely changed the architecture.
Instead of hitting the database for each query, we built a network
graph in memory from the database when the application was started
and rebuilt the graph using notification triggers from the database
when changes where made. This resulted in over a 36X speed increase,
easily returning routes within the needed time frame.
Architecture and algorithms are the most important aspect of a design.
Now let's talk about process vs. thread parallelism. Java is the
only garbage collected language that I am aware of that supports
native threads with very little overhead. Python supports native
threads but the global interpreter lock significantly hinders it's
usefulness. Ruby doesn't have native threads although it's threading
model is useful in other ways.
The Java application that I mentioned double buffered the network
graph. Worker threads looked for routes while a seperate builder
thread was responsible for maintaining an update network graph.
Graphs were built in the background on a seperate CPU and swapped in
when finished. This allowed route finding to run continuously. This
is possible because threads share the same address space. This would
be a much more difficult problem using process level parallelism.
Process level parallelism was also used in this application to
provide horizontal scaling. As load increased, additional processing
elements could be added in the form of additional machines and the
work control daemon would load balance the route find requests across
all available route finding processes, wherever they might live.
Another important advantage of a garbage collected language is the
ability to quickly get something done allowing for a more iterative
style of development. This leads to having a working example early
on that can easily be iteratively modified until it provides exactly
the functionality needed.
When I first came to XXXXX, I was asked to build a system to
collect performance and usage metrics from all of our network gear
via SNMP, both for capacity and planning purposes and for a new usage
based billing system. We had tens of thousands of switches and
routers in our network spread across the US, Europe and Asia. I had
to collect samples from each device every 5 minutes.
I was hired because I said that I could do this job in a couple of
months. Everyone else claimed that it would take a year or more to
do. My secret weapon was Python. There were no SNMP bindings at the
time for Python but it was easy to add that capability via the CMU
SNMP libraries. I finished the first working version in less than a
month.
This application used distributed collection nodes that lived in each
major city in our network. In the beginning, this was 36 nodes.
They, in turn, fed data to a large 12-processor central collecting
node that preprocessed the data before sending it to a data warehouse
running on a large 56 processor Sun E-10K.
I did this by myself in 1/12 the time that anyone else thought
possible. It could not have been done in this timeframe without a
garbage collected language like Python. Ruby would have worked just
as well but I didn't know anything about it at the time.
This story is repeated over and over.
The next project was a Solaris/Linux performance monitoring system
that gathered over 140 different parameters from the OS every 10
seconds. It would seem this is not the right place to use a
"scripting" language, right? Wrong. With judicious use of C
libraries, the higher level language can be used for most of the
logic (read the majority of the code) and the small amount of
performance sensitive code can be in C. This still gives a
tremendous advantage in development speed and maintainability in the
future.
This application was written in Python and interfaced with Solaris
using the kstats library. It collects data every 10 seconds, rolls
it up every 5 minutes, makes it available for off machine collection
via the network and uses less than 2% of the CPU.
Using this tool, we found that on average, the 2000+ servers that we
were monitoring used less than 20% of their CPU on average. In other
words, CPU is rarely an issue when deciding what language to use for
a project.
We wrote this application internally, even though commercial
alternatives, existed because we were quoted over $20 million dollars
for an equivalent system by more than one vendor.
The second part of this story is the central collection system that
collects the data from each host in the network and does long term
trending and analysis.
Our original collection and analysis system used SAS, which costs
roughly $300,000/year the way we were using it. I rewrote the entire
collection and analysis system in Ruby during a six month period,
saving the $300,000/year and increasing performance by an order of
magnitude. The Ruby based system also uses Postgres to store the
data it collects and analyzes.
Today we are using Ruby for VoIP test automation.
BTW, none of the applications that I mentioned besides the first
could have been done successfully in Java because of the memory
requirements of the Java interpreter vs. resources available on the
machines the applications would run on. However, both Python and
Ruby would work fine.
When is a lower level language necessary? Only when performance
dictates. Even then, most of the program can be written in Ruby/
Python/Java, etc. with only the performance sensitive pieces in C/C++/
Fortran/asm.
In the past, I've worked writing video games for Virgin/Mastertronic
and writing real-time data collection and security systems for
nuclear power plants.
While writing video games, I witnessed the transition from assembly
language to C. In the nuclear industry, I saw the transition from
special purpose, real time operating systems to VMS, QNX and Unix and
from assembly and fortran to C then C++. Each transistion raised the
same questions about performance, etc that you are asking now about
garbage collected languages.
I hope this is useful.
Rick
--
Rick Nooner
rick@nooner.net
Comments