Sagewire Logo

A memcached-like server in Ruby - feasible?

17 Message(s) by 8 Author(s) originally posted in ruby programming


From: Tom Machinski Date:   Saturday, October 27, 2007
Hi group,

I'm run ning a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object , with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that is not supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sort ed by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server , and let the Rails process retrieve them on each request. Then
the process could perform the filter ing itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we'd pass 100K large objects,
while the process only real ly needs 20 of them.

Ideally, we could call:

memcached_improved.fetch_newest( :attributes => { :bar => 23 }, :limit
=> 20 )

and have the improved_memcached server filter and return only the
required 20 objects by itself.

Now the question is:

How expensive'd it be to write memcached_improved?

On the surface, this might seem easy to do with something like
Daemons[1] in Ruby (as most of our programmers are Rubyists). Just
write a simple class, have it run a TCP server and respond to
requests. Yet I'm sure it's not that simple, otherwise memcached'd
have been trivial to write. There are probably stability issues for
multiple concurrent clients, multiple simultaneous read/write requests
(race conditions etc.) and heavy loads.

So, what do you think:

1) How'd you app roach the development of memcached_improved?

2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?

3) How much time / effort / people / expertise should such a task
require? Is it feasible for a smallish team (~4 programmers) to put
together as a side-project over a couple of weeks?

Thanks,
-Tom
--
[1] http://daemon s.rubyforge.org/


From: Lionel Bouton Date:   Saturday, October 27, 2007
wrote in message:
Hi group,
I'm running a very high-load website done in Rails.
The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.
Allow me to explain.
memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.
One thing we need, that is not supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.



It looks like the job a database'd do for you. Retrieving 20 large
objects with such conditions should be a piece of cake for any properly
tuned database. Did you try this with PostgreSQL or MySQL with indexes
on created_on and bar? How much memory did you give your database to
play with ? If the size of the objects is so bad it takes too much time
to extract from the DB (or the trafic is too much for the DB to use its
own disk cache efficiently) you could only retrieve the ids in the first
pass with hand-crafted SQL and then fetch the whole objects using
memcache (and only go to the DB if memcache does not have the object you
are looking for).

Lionel.


From: Tom Machinski Date:   Saturday, October 27, 2007
wrote in message:
It looks like the job a database'd do for you. Retrieving 20 large
objects with such conditions should be a piece of cake for any properly
tuned database. Did you try this with PostgreSQL or MySQL with indexes
on created_on and bar?



Yes, I'm using MySQL 5, and all query columns are indexed.

How much memory did you give your database to
play with?



Not sure right now, I will ask my admin and reply.

If the size of the objects is so bad it takes too much time
to extract from the DB (or the trafic is too much for the DB to use its
own disk cache efficiently) you could only retrieve the ids in the first
pass with hand-crafted SQL and then fetch the whole objects using
memcache (and only go to the DB if memcache does not have the object you
are looking for).



Might be a good idea.

Lionel.



Long term, my goal is to minimize the amount of queries that hit the
database. Some of the queries are more complex than the relatively
simple example I have given here. And I do not think I could optimize
them much beyond 0.01 secs per query.

I was hoping to alleviate with memcached_improved some of the pains
associated with database scaling, e.g. building a replicating cluster
etc. Basically what memcached does for you, except as demonstrated,
memcached by itself seems insufficient for our needs.

Thanks,
-Tom


From: ara.t.howard Date:   Sunday, October 28, 2007
wrote in message:

Hi group,
I'm running a very high-load website done in Rails.
The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.
Allow me to explain.
memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.
One thing we need, that is not supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.
We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we'd pass 100K large objects,
while the process only really needs 20 of them.
<snip>



i'm reading this as

- need query
- need readonly
- need sorting
- need fast
- need server

and thinking: how is not this a readonly slave database? I think that
mysql can either do this with a readonly slave *or* it cannot be done
with modest resources.

my 2cts.
a @xxxxxxxxxxx http://codeforpeople.com/
--
it isn't enough to be compassionate. you must act.
h.h. the 14th dalai lama


From: M. Edward Date:   Sunday, October 28, 2007
wrote in message:
wrote in message:
Hi group,

I'm running a very high-load website done in Rails.

The number and duration of queries per-page is killing us. So we're
thinking of using a caching layer like memcached. Except we'd like
something more sophisticated than memcached.

Allow me to explain.

memcached is like an object, with a very limited API: basically
#get_value_by_key and #set_value_by_key.

One thing we need, that is not supported by memcached, is to be able to
store a large set of very large objects, and then retrieve only a few
of them by certain parameters. For example, we may want to store 100K
Foo instances, and retrieve only the first 20 - sorted by their
#created_on attribute - whose #bar attribute equal 23.

We could store all those 100K Foo instances normally on the memcached
server, and let the Rails process retrieve them on each request. Then
the process could perform the filtering itself. Problem is that it's
very suboptimal, because we'd have to transfer a lot of data to each
process on each request, and very little of that data is actually
needed after the processing. I.e. we'd pass 100K large objects,
while the process only really needs 20 of them.
<snip>
i'm reading this as
- need query
- need readonly
- need sorting
- need fast
- need server
and thinking: how is not this a readonly slave database? I think that
mysql can either do this with a readonly slave *or* it cannot be done
with modest resources.
my 2cts.



Add "large set of very large (binary?) objects". So ... yes, at least
*one* database/server. This is exactly the sort of thing you *can* throw
hardware at. I guess I'd pick PostgreSQL over MySQL for something like
that, but unless you're a billionaire, I'd be doing it from disk and not
from RAM . RAM-based "databases" look really attractive on paper, but
they tend to look better than they really are for a lot of reasons:

1. *Good* RAM -- the kind that does not fall over in a ragged heap when
challenged with "memtest86" -- isn't inexpensive. Let's say the objects
are "very large" -- how about a typical CD length of 700 MB? OK ... too
big -- how about a three minute video highly compressed. How big are
those puppies? Let's assume a megabyte. 100K of those is 100 GB. Wanna
price 100 GB of *good* RAM? Even with compression, it does not take much
stuff to fill up a 160 GB iPod, right?

2. A good RDBMS design / query planner is amazingly intelligent, and you
can give it hints. It might take you a couple of weeks to build your
indexes but your queries will run fast afterwards.

3. RAID 10 is your friend. Mirroring preserves your data when a disk
dies, and striping makes it come into RAM quickly.

4. Enterprise-grade SAN s have lots of buffering built in. And for that
stuff, you do not have to be a billionaire -- just a plain old millionaire.

"Premature optimization is the root of all evil?" Bullshit! :)


From: ara.t.howard Date:   Sunday, October 28, 2007
wrote in message:

it does not take much stuff to fill up a 160 GB iPod, right?



http://drawohara.tumblr.com/post/17471102

could not resist...

cheers.

a @xxxxxxxxxxx http://codeforpeople.com/
--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama


From: Bill Kelly Date:   Sunday, October 28, 2007
From: "ara.t.howard" <ara.t.howard@xxxxxxxxxxx>
wrote in message:
it does not take much stuff to fill up a 160 GB iPod, right?
http://drawohara.tumblr.com/post/17471102



BTW, my wife and I were only able to fit about 3/5ths of our CD
collection on our 40 gig iPod. (I rip at 320kbps mp3 admittedly.)

So while a 160 GB iPod'd be slightly overkill for us, it
would not be outrageously so.Regards,

Bill


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
The other thing you can play with is using sqlite as the local (one
per app server) cache engine.



Thanks, but if I'm already caching at the local process level, I might
as well cache to in-memory Ruby objects; the entire data-set is not
that huge for a high-end server RAM capacity: about 500 MB all in all.

YS.



-Tom


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
i'm reading this as
- need query
- need readonly
- need sorting
- need fast
- need server
and thinking: how is not this a readonly slave database? I think that
mysql can either do this with a readonly slave *or* it cannot be done
with modest resources.



The problem is that for a perfectly normalized database, those queries
are *heavy*.

We're using straight, direct SQL (no ActiveRecord calls) there, and
several DBAs have already looked into our query strategy. Bottom line
is that each query on the normalized database is non-trivial, and they
can not reduce it to less than 0.2 secs / query. As we've 5+ of these
queries per page, we'd need one MySQL server for every
request-per-second we want to serve. As we need at least 50 reqs/sec,
we'd need 50 MySQL servers (and probably something similar in terms of
web servers). We can not afford that.

We can only improve the queries TTC by replicating data inside the
database, i.e. de-normalizing it with internal caching at the table
level (basically, that amounts to replicating certain columns from
table `bars` in table `foo s`, thus saving some very heavy JOINs).

But if we're already de-normalizing, caching and replicating data, we
might as well create another layer of de-normalized, processed data
between the database and the Rails servers. That way, we'll need
less MySQL servers, output requests faster (as the layer'd hold
the data in an already processed state), and save a much of the
replication / clustering overhead .

-Tom


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
I do not recommend you use this project, I have not used it myself for quite a
while and it has a number of issues I have not address ed. You may find it a
helpful basis implementation if you should decided to go the pure Ruby
route.



Thanks, Ian!

Would you mind sharing - here, or by linking a blog / text, or
privately if you prefer - some information about these issues?

I'm asking for two reasons:

1) To learn about possible pitfalls / complications / costs involved
in taking the pure Ruby route.

2) We may decide to adopt your project and try to address those issues
to use the patched Boogaloo in production.

Thanks,
-Tom


From: Andreas S. Date:   Sunday, October 28, 2007
wrote in message:
wrote in message:
with modest resources.
The problem is that for a perfectly normalized database, those queries
are *heavy*.
We're using straight, direct SQL (no ActiveRecord calls) there, and
several DBAs have already looked into our query strategy. Bottom line
is that each query on the normalized database is non-trivial, and they
can not reduce it to less than 0.2 secs / query.



Try enabling the MySQL query cache. For many applications even a few MB
can work wonders.
--
Posted via http://www.ruby-forum.com/.


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
Try enabling the MySQL query cache. For many applications even a few MB
can work wonders.



Thanks, that's true, and we already do that. We've a very large
cache in fact (~500 MB) and it does improve performance, though not
enough.

-Tom


From: Stanislav Sedov Date:   Sunday, October 28, 2007
On Sun, Oct 28, 2007 at 07:31:30AM +0900 Tom Machinski mentioned:
2) Is this task doable in Ruby? Or maybe only a Ruby + X combination
(X probably being C)?


I beleive, you can achive a high efficiency in server design by using
event -driven design. There're some event libraries for ruby available,
e.g. eventmachine. In this case the scalability of the server should
be comparable with the C version.

Thread will have a huge overhead in case of many clients.

BTW, the original memcached uses event-driven design too, IIRC.

--
Stanislav Sedov
ST4096-RIP E


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
I beleive, you can achive a high efficiency in server design by using
event-driven design. There're some event libraries for ruby available,
e.g. eventmachine. In this case the scalability of the server should
be comparable with the C version.
Thread will have a huge overhead in case of many clients.
BTW, the original memcached uses event-driven design too, IIRC.



Yes, memcached (including latest) uses libevent.

I'm not completely sure whether a production-grade server of this sort
is feasible in Ruby. Many people, both here and elsewhere, seem to
think it should be done in C for better stability / efficiency /
resource consumption.

Thanks,
-Tom

--
Stanislav Sedov
ST4096-RIPE






From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
Add "large set of very large (binary?) objects". So ... yes, at least
*one* database/server. This is exactly the sort of thing you *can* throw
hardware at. I guess I'd pick PostgreSQL over MySQL for something like
that, but unless you're a billionaire, I'd be doing it from disk and not
from RAM. RAM-based "databases" look really attractive on paper, but
they tend to look better than they really are for a lot of reasons:
1. *Good* RAM -- the kind that does not fall over in a ragged heap when
challenged with "memtest86" -- isn't inexpensive. Let's say the objects
are "very large" -- how about a typical CD length of 700 MB? OK ... too
big -- how about a three minute video highly compressed. How big are
those puppies? Let's assume a megabyte. 100K of those is 100 GB. Wanna
price 100 GB of *good* RAM? Even with compression, it does not take much
stuff to fill up a 160 GB iPod, right?



I might've impressed you with a somewhat inflated view of how large
our data-set is :-)

We've about 100K objects, occupying ~500KB per object. So all in
all, the total weight of our dataset is no more than 500MBs. We might
grow to maybe twice that in the next 2 years. But that's it.

So it's very feasible to keep the entire data-set in *good* RAM for a
reasonable cost.

2. A good RDBMS design / query planner is amazingly intelligent, and you
can give it hints. It might take you a couple of weeks to build your
indexes but your queries will run fast afterwards.



Good point. Unfortunately, MySQL 5 does not appear to be able to take
hints. We have analyzed our queries and there's some strategies there we
could definitely improve by manual hinting, but alas we'd need to
switch to an RDBMS that supports those.

3. RAID 10 is your friend. Mirroring preserves your data when a disk
dies, and striping makes it come into RAM quickly.
4. Enterprise-grade SANs have lots of buffering built in. And for that
stuff, you do not have to be a billionaire -- just a plain old millionaire.



We had some bad experience with a poor SAN setup, though we might've
been victims of improper installation.

Thanks,
-Tom


From: Tom Machinski Date:   Sunday, October 28, 2007
wrote in message:
So alter memcached to accept a 'query' in the form of arbitrary ruby (or
perhaps a pre-defined ruby) that a peer-daemon is to execute over the set of
results a particular memcached node contains.



Yeah, I thought of writing a Ruby daemon that "wraps" memcached.

But then the wrapper would've to deal with all the performance
challenges that a full replacement to memcached has to deal with,
namely: handling multiple concurrent clients, multiple simultaneous
read/write requests
(race conditions etc.) and heavy loads.

A naive implementation of memcached itself'd be trivial to write;
memcached's real merits aren't its rather limited featureset, but
its performance, stability, and robustness - i.e., its capability to
overcome the above challenges.

The only way I could use memcached to do complex queries is by
patching memcached to accept and handle complex queries. Such a patch
won't have anything to do with Ruby itself,'d probably be very
non-trivial, and will have to significantly extend memcached's
architecture. I doubt I have the time to roll out something like that.

-Tom


From: marc spitzer Date:   Sunday, October 28, 2007
wrote in message:
wrote in message:
The other thing you can play with is using sqlite as the local (one
per app server) cache engine.
Thanks, but if I'm already caching at the local process level, I might
as well cache to in-memory Ruby objects; the entire data-set is not
that huge for a high-end server RAM capacity: about 500 MB all in all.



What'd happen if you used two stages of mysql databases? What I mean
is that you've your production db with all your nice clean structure for
writing new data to and as a master source for your horrible demoralized
db then you've a job that pushes changes every N minutes to the evil
ugly read only db. It is a new step in production, but it does allow you
to stick with the same tech mix you are using now.

marc
--
ms4720@xxxxxxxxxxx
SDF Public Access UNIX System - http://sdf.lonestar.org



Next Message: [ANN] RMagick 2.0.0 beta5 builds with Ruby 1.9.0



Programming | Sports | Autos

copyright 2006
Valid XHTML 1.0 Transitional