Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ruby 3, Concurrency and the Ecosystem (kirshatrov.com)
237 points by ksec on Jan 8, 2021 | hide | past | favorite | 59 comments


I've a hard time seeing how this could work out. Admittedly it's been some time since I did ruby but as I recall there was a shocking amount of mutability going on in the core language and throughout the ecosystem. Mutable lists, tables and strings everywhere, even the "constants" of ruby were mutable by default which bit me more than once.

I'm just not convinced that good parallellism is something you can successfully bolt on to a language 25 years down the road. The languages I've seen do a really good job all embraced parallelism as an important problem from the start and had a strong policy of immutability by default and good persistent data structures. Such a clear commitment permeates the ecosystem and you get libraries with clear parallelism stories .

Hopefully I'm wrong and this will all work out great for ruby.


Yes, like the amount of thinking and novel work Rich Hickey out into the immutable but fast data structures in Clojure, the use of mvcc stm for transactions when you do need to share mutable state between threads - it was there from the start. I don’t even pretend to fully understand half of the things going on in Clojure concurrency on a deep level but I find it reassuring that parallelism was deeply considered from the start. And also grew with even deeper models like core.async.

How many languages are there like this? A handful? Erlang, Scala, Haskell, Go.... Java has always been parallel but with pretty primitive concurrency tools, no?


Java started with the same concurrency primitives as Ada and Modula-3.

Java 5 got the concurrent package, with support for futures, chained computations, ability to create ou own schedulers, only matched by .NET dataflows.

Java 8, integrated the concurrent package with parallel streams.

Recent versions are being refactored so that eventually virtual threads are surfaced across all runtime, so instead of not knowing if green or red threads are being used like in the early Java days, the developers get to choose which ones to use.

On .NET side, naturally it copied the Java primitives, then came TPL, Dataflow, async/await, PLINQ, F# agents, F# async/await (not the same as standard async/await, based on F# computations).Additionally there is the research done with Axum, Cω, Singularity, Midori, Immutable C#.


In Go pretty much everything is very much mutable. The part that distinguishes Go is exactly the M:N scheduler used for both evented IO and parallelisation of compute.

For _concurrency_ (what the Ruby fiber schedulers are for) what is proposed is probably quite a good idea. We can do as much hatin' as we want on JavaScript but its concurrency story has been much more successful than Ruby's due to evented IO as well - even though everything is mutable. As far as Ractor is concerned - well, WebWorkers are kind of similar - they get a separate thread, but you can only send them "messages" which must be immutable. How the MRI team is going to tackle mutations in those messages is an interesting excercise, because Ruby is ultimately a _live runtime_ - i.e. most of the things can change at any time. Not because it is made for sloppy programming but because it is made for incremental changes to the VM like Smalltalk.


> How the MRI team is going to tackle mutations in those messages

Object.freeze, presumably.

Immutability exists as an option within Ruby’s OOP system, and has since the beginning, so it’s well-integrated[1], with other language features having been built to take it into consideration.

I would expect any[2] message sent to a Ractor to be

1. On the sending side, verified as being acyclic;

2. On the receiving side, reconstructed by doing a recursive “clone and then freeze” operation.

Since most messages wouldn’t be “deep”, this wouldn’t usually be costly; but in the case of a “deep”(ly hierarchical) message, you could probably pre-“deep-clone-and-freeze” the objects you’re going to send, and the runtime would hopefully detect this and not bother to deep-clone-and-freeze them again. (Maybe they’d add a runtime tag-bit on objects for “me and all my references — transitively! — are frozen”, that could be pre-set for any object when it’s frozen, if it sees that all its instance-variables are also already marked as transitively-frozen.)

Alternately, they might just lean on the Marshal module, adding the capability for Marshal to load a dumped object-graph as all-frozen; and then just use Marshal.dump on the sending side and Marshal.load(..., freeze: true) on the receiving side. This wouldn’t allow for the efficiency wins of the above, but it would have the advantage of being a single already-well-tested-and-optimized all-in-C transformation; and it would give the MRI maintainers some docs to point at (those of Marshal.dump) to indicate what’s possible to send in a Ractor message.

—————

[1] Integers, Floats, and Symbols come frozen by default; and there’s a pragma you can add to source files such that their string-literals will also be created frozen. So it’s not like Ruby code isn’t “prepared” for frozen objects. Your own Ruby code deals with plenty of them!

[2] What if you want a mutable reference to an object in another thread? Well, think about the semantics of what you want. Presumably, you want sending a message to the object to actually first send (a recursive frozen clone) of your message back to the object’s owner-Ractor to handle. What system already does this? DRb! You can send a DRb object-proxy handle to another Ractor, and it won’t matter that it receives a recursively-frozen clone of it; it’ll still work to communicate with your own Ractor’s mutable “remote” object. Think of DRb handles as the Erlang PID-objects of Ruby.


Indeed, but I don't see as much issue with things which are freeze-able. You can Marshal.dump cyclic structures already with no issue. What seems to be the culprit at the moment is changing the method definitions / module composition of modules which participate in exchanges between Ractors. A lot of Ruby modules define constants dynamically, prepend modules to patch missing or broken functionality on existing modules, modify constants to patch bugs. Since there is no compile step / monomorphisation ahead of time it is likely that a program will be evolving after the VM has started and all the code has been loaded. I do not see the Ractor messaging setup covering it well at the moment. Maybe Ractor messages should be their own type which is only allowed to contain marshalable objects, and inside a Ractor the VM must be able to "replay" the same code structure mutations as the the entire VM (or those mutations must be cloned into the Ractor where appropriate).

While a lot of people scream that "you should not monkeypatch" (and it is pretty much always a good idea not to) it is not always practical, given the fact that a lot of software produced is imperfect and does need careful patching sometimes, and Ruby is great at allowing it.


Languages like Haskell, Clojure, Erlang provide good "modern" abstractions like STM which are great.

But even having the basic threading primitives built in from a early point would help I think (e.g. Java, C++).

Why? Because it does not let library authors entertain the idea that they can just ignore all parallelism concerns.


Don't forget Kotlin's coroutines.


Did you even read about the implementation details here? It appears you didn’t.


Please tell me how I'm wrong instead of insinuating that I didn't read the article.

> It will likely take some efforts and at least a year of work from the community to push libraries towards less shared state.

About one year to rewrite the ecosystem to fit this new threading model. Does that strike you as particularly realistic?

Like I said, I very much wish they would succeed.


Ractors don't share everything, unlike threads.

Most objects are Unshareable objects, so you don't need to care about thread-safety problem which is caused by sharing.


They downplayed the actual amount of time that went into these changes and the upcoming changes. Here's the history:

Matz[1] released the first version of Ruby in Dec 1995.

DHH was a major player in getting Ruby into the global spotlight with Rails[2] in 2004. Rails got very popular as a framework for developing new applications, with Basecamp being novel, showing that it could work well and introducing people to REST, in a flexible interpretation, as well as ActiveRecord, whose ease of use and migrations became a model for modern web development.

Rails v3 divided the community, specifically around how and what Rails would support for the server and request-handling. This hinted at problems to come, but Rails was still strong, and many took it with a grain of salt and upgraded.

However, Twitter, which had been built on Rails became popular, and the "fail whale" emerged as they were unable to handle all of the requests. This was not a problem with scaling Rails, but with them knowing how they could scale Rails without much greater expense, but since they had to rewrite things and there was pressure to get scaling done right, they switched to Scala and Java, since Scala was functional and fast, and there was a lot of support for the JVM. Functional programming had already been making a comeback in popularity in the 2000s, because it often required a lower memory footprint and was fast. But, at that point in time, many teams and developers were looking into it.

Though it wasn't the first time he'd done optimization, in 2012, Matz released mruby[1][3], an embedded Ruby.

Around the same time, with functional programming having been cool, Elixir was born and some of the Rails community left for writing Ruby/Rails-ish code in Erlang.

Some had been trying to slim down Rails in core, so that there would be less code needed to serve requests.

Tenderlove, who came from the system programming side of things, joined the Rails core team with a focus on optimization, did work on Rack, and eventually he started working to help speed up Ruby.

For years, Matz and others had focused on speeding up and slimming down Ruby. Ruby had run on Lighttpd and Ruby on Rails could run on it also.

All of these things have been driving Ruby to get better, and now it is.

So, no, I don't think it's realistic that they put a year into it. At least 9+ calendar years led to this point, and it's been 26+ calendar years since initial release. And this isn't the end of it. It's not trying to compete with or tank your favorite framework or language of choice, it's just been improving and its team, even as good as it already was, has been improving.

P.S.- Ruby is not Rails. But not talking about how the history of Rails in the scope of things would be remiss. I can't think of anything in the history of Ruby that has been bad, but certainly Rails has had its "fun". But right now, it's coming together. I also didn't mention Sinatra's influence on slimming things down, or Puppet, Chef, etc.'s contribution to the Ruby community, or Crystal which has been a valiant effort for a compiled Ruby-like ecosystem. There is so much that happened leading to today that shaped where things are and where they are going. I'm totally psyched about this.

[1]- https://en.wikipedia.org/wiki/Yukihiro_Matsumoto

[2]- https://en.wikipedia.org/wiki/Ruby_on_Rails

[3]- https://github.com/mruby/mruby


You missed the part of the history when Ruby could have been Swift, but eventually things went sour, the creator left Apple and ended up selling his work for mobile apps development.

https://en.wikipedia.org/wiki/RubyMotion

http://www.rubymotion.com/


As long as the language supports the primitives needed for implementing good parallelism, I do not see why it could not be done in form of a library. Of course the ecosystem would have to evolve to make use of such. However, it seems more and more language get on the bus for having concurrency libraries, that derive from CSP directly or indirectly as in "we want concurrency like they have in language x".

In some languages functionality is integrated and in some languages people developed it as a library. Which one is the right thing to do probably depends on what you can work with in the language.

Just to name some examples that come to mind: Erlang (messag passing to actors), Go (hyped go routines), Rust (multiple frameworks, Bastion comes to mind), Guile (guile-fibers, a library, because the language offers the primitives), Pony (actors, similar to Erlang), Elixir (on top of Erlang)

I do not know sufficient Ruby, to tell whether or not it offers the primitives required to implement this stuff as a library. If it does, then there is a higher chance of anyone eventually getting it right. If it does not, it will depend on the language designers to get it right.


I don't understand bolting concurrency onto something that is interpreted. A compiler would get you 10-100x speedup, without your users having to do the complex thing of writing safe parallel code.


A lot of web services do a little bit of work, wait on a DB query, do a little bit of work, wait on DB, etc. It’s even more true with websockets, which spend most of their time idle waiting for messages.

Good concurrency support lets you handle thousands (even millions) of these activities in a single process, which is massively more efficient than using a process each, and puts less load on other systems (DB, Splunk, Graphite, RabbitMQ…)

Green threads in particular make it possible to get these benefits without writing the code in async style. You just write regular, blocking code and the runtime / frameworks take care of yielding to the scheduler (event loop) when appropriate.


Ruby can’t be compiled easily due to the very dynamic semantics of the language. If you wanted to compile it, it would probably start behaving a lot more like Crystal, which looks similar but behaves completely differently.


SELF, Smalltalk beg to differ in what concerns JIT compilation.


worth noting that there has been a ruby-on-smalltalk-vm-technology implementation for more than ten years, but it didn't catch on

http://maglev.github.io/


Well, "To run MagLev you’ll also need a GemStone/S Server.".

So did it run Rails?


If you're talking about just compilation then it's easy enough doing something like context-threading where you translate bytecode into a call instruction per operation to a function implementing the bytecode.

If you're talking about producing a static binary then both Rubymotion and TruffleRuby can do this.


> If you're talking about just compilation then it's easy enough doing something like context-threading where you translate bytecode into a call instruction per operation to a function implementing the bytecode.

Ruby has already been doing threaded dispatch on bytecodes since 1.9.


Context-threading is a JIT technique. CRuby can optionally use call-threading where each op is a function and dispatch is an infinite loop of function calls. It's not the default since direct or token threading are faster but it's there.

Call-threading looks like this:

  while(1) { interpreter_state = *virtual_program_counter++(interpreter_state); }
There's an indirect jump here.

With context-threading you JIT very simple code which looks like this and has only direct calls:

  interpreter_state = put1(interpreter_state);
  interpreter_state = put1(interpreter_state);
  interpreter_state = plus(interpreter_state);


I should have clarified that I meant "compiled easily for 10-100x speedup" (on representative workloads) as the OP was saying.

You can definitely compile anything... whether it's worthwhile is a totally different story. I learned this the hard way :)

As far as I remember the Ruby 3x3 effort yielded 3x on SOME workloads, with tremendous effort, which is about what I would have expected due to Ruby's very dynamic semantics.


TruffleRuby already has that kind of performance increase on microbenchmarks.

The CRuby JIT for 3x3 built by one person in their spare time.


you could aot compile most of most programs and jit the rest. or just jit everything


CRuby has had JIT for over 3 years now and I think JRuby must be over a decade?


are those first class options now? or is it like python where you can kinda sorta sometimes use a jit?


First class option in Ruby 2.6 onwards.


Yeah, but not that impressive in capabilities, hence other projects.

https://developers.redhat.com/blog/2020/01/20/mir-a-lightwei...


While MIR could be an interesting alternative to CraneLift it doesn't really address the challenges faced by trying to improve performance via JIT compilation within CRuby.


Ruby has taught me a lot and launched my career as a dev. I enjoyed meeting the people I have met through it. I have gotten steady, good work for years. I really cut my teeth with working with DSLs, working with mixins, domain boundaries, designing for developer happiness, etc.

... but there's a reason I switched over to using Elixir instead. And that is being able to work with OTP.


Excellent points in the article about "top" (booting off top-level workers in Ractors) versus "bottom" (parallelization at the edges of the code where not as much state needs to be shared) cases, with the top likely to take more time to widely manifest as more Ruby code becomes Ractor friendly.

I know Ractors are still quite new, but I'm curious to see how much luck people will have in migrating existing parallel code over to them.

Relying on a lot of shared state is going be the major blocker. Just on a token reading of the docs [1], it looks like you should be able to share quite a lot of state still as long as you're able to freeze it all into immutability. You could then use message passing around the edges where immutability is a non-starter.

But this is probably easier said than done.

---

[1] https://docs.ruby-lang.org/en/3.0.0/Ractor.html


Some silly multi threaded questions incoming.

I ran a quick benchmark here is the code

  require 'benchmark'
  require 'prime'
  Benchmark.bm do |x|

  x.report('single'){
    8.times do 
      10_000_000.times.each do |num|
        num.prime?
      end
    end
  }

  x.report('thread'){ 
    8.times.map do 
      Thread.new do 
        10_000_000.times.each do |num|
          num.prime?
        end
      end
    end.each(&:join)
  }

  x.report('parallel'){
    8.times.map do
      Ractor.new do 
        10_000_000.times.each do |num|
          num.prime?
        end
      end
    end.each(&:take)
    }
  end
results

single -> 121.66s

thread -> 122.14s

parallel -> 52.84s

The single and threaded version were the same and this is mostly expected since we don't have any I/O in the threads which would of given up the thread and scheduled another to be run.

For prime calculations, we are just bound by pure CPU and limited by ruby's global interpreter lock I assume this would be the same case in Python.

1) A silly question and more language irrelevant, when looking at my CPU logical core usage during the single and fake threaded version I could see tiny spikes in all the cores. [0] No one core dominated the task how does this work? Does the CPU know to workshare single threaded tasks and branch prediction etc comes into play here? How are modern CPU's achieving this at a lower level? In my head this task would just spike a single core to do all the work.

2) Running the 10_00_000 primes just once completes in 15.3s shouldn't this be the same amount of time for 80_00_000 if the Ractor implementation was optimized across my 8 logical cores? Not a punch or anything at guys working on lang its freaking awesome to see my CPU being maxed out with my 8 logical processors being used in Ruby! [1] but just want to double check my understanding.

[0] https://postimg.cc/rKjB9kQP [1] https://postimg.cc/WtvpWswN


> when looking at my CPU logical core usage during the single and fake threaded version I could see tiny spikes in all the cores

What do you mean by 'fake threaded'? Ruby threads were already fully concurrent OS threads. They're likely scheduled on different physical cores, so will spike your different cores each time they manage to get the GIL and run.

> if the Ractor implementation was optimized across my 8 logical cores?

There's synchronisation. Many VM services are global.


Ah okay in that case that makes much more sense to me for the threaded version, I was meaning fake threaded in the sense only one thread is ever ran at a time but of course they are scheduled onto different OS threads like you say which means they will run across my logical cores.

But for the single threaded version I still seen spikes across all cores is this some CPU magic and if so what techniques are involved? Or am I just misreading my CPU usage like a doughnut?

> There's synchronisation. Many VM services are global.

What do you mean by this?

And I just realized I'm talking to Chris Seaton thank you for your work on TruffleRuby super awesome project.


> But for the single threaded version I still seen spikes across all cores is this some CPU magic and if so what techniques are involved?

The kernel must be moving the Ruby process around if you're looking at the single-threaded case. That doesn't sound right though - moving a process to a different core damages its cache. But Ruby doesn't influence that it's down to the OS. Maybe the kernel has some subtle good reason to do it if you're running other services in the background?

> What do you mean by this?

Ractors acquire locks and CAS to do things to update the VM on what they're doing. They don't run entirely purely in parallel. GC is the big example. This benchmark shouldn't really need to use the GC... but maybe it does. Does your system malloc have a global lock too?


> Maybe the kernel has some subtle good reason to do it

I think OSes spread load across cores to prevent unequal heating across the chip.


Hmm that’d be a disaster for cache I think that’s extremely unlikely but I’m not an OS expert.


If you want a process to stay put in a specific core, you need to ask the OS for it, however this might still influence the performance, as naturally there are other processes running there and the OS cannot honour the request if everyone does the same.

However it already improves not copying the process context across all cores.

I used this to good effect back in the Windows NT/2000 days, whose scheduler wasn't that smart about when to move processes across cores.


You are comparing something that happens on the scale of multiple seconds to something that happens on the scale of microseconds.


For 1): This isn't CPU magic but the scheduler in the OS. Even when running your program with only one thread, your OS will still schedule the 100s of threads that are running on your machine (from all applications you have open) onto the 8 cores on your machine. It constantly switches between these threads, and your Ruby thread won't be put on the same core each time.

Furthermore, OSes spread load across cores on purpose to avoid unequal heating inside the processor. You can 'pin' a process to a specific core but this is not recommended. (The advantage is maybe better caching, but the disadvantage is unequal heating.)


Do you have a reference discussing core scheduling specifically to manage heating?

There's a ton of different rationales for scheduling processes to cores, and lots of core scheduling algorithms, but I wasn't aware of try to keep all the cores the same temperature.

Modern CPUs can run each cores at a different clock, and you get higher clock speeds when fewer cores are running, so depending on overall system load, condensing to fewer cores can result in better throughput. As well, individual core performance varies, so some schedulers will try to identify which cores can clock the highest and schedule tasks on those cores first.

Generally, there are benefits from keeping a process running on the same core, but there's all sorts of scenarios that end up with a process bouncing over many cores.

For example, if you have an active cpu program and get an interrupt serviced by the core its running on, an otherwise idle core may steal the process, because the process is runnable but not running. Pinning it to any particular core might work better, but if you pin processes without a plan for all your processes, you can end up with a lot worse throughput. It's pretty easy to pin too many things to a single core and that core is 100% busy and other cores are idle, even though there were enough processes to saturate all the cores. Of course, if all of those processes are working together on the same shared memory with lots of lock contention, you might have worse throughput because of cross-core synchronization.

You can spend a lot of time tweaking things, but most OSes have a pretty decent general purpose scheduler. Pinning works great for things where you take up the whole machine with a single process, and that process has good reasons to pin cpus. Ex: you're running Erlang which has its own scheduling, or you're running HAProxy and you've done the work to tightly bind each process to a specific NIC queue interrupt, so you want that process to run on the same core that handles that interrupt to avoid cross-core communication.


Interestingly, on my machine, all 3 implementations took the same amount of realtime (~30s), but the ractor implementation took ~200s in systemtime. I'm guessing the ractor switching overhead ate up whatever parallelization gain it got.


Fyi, `.times` is enumerable, you don't need to call `.each` after it.


I don't understand, actix-web uses async / await and is ultra fast. https://www.techempower.com/benchmarks/

Does this have to do with the Tokio runtime that more efficiently coordinates the asynchronous code?

Automatically translated.


I'd expect any python codebase which tries to use async/await based co-operative multi-tasking for concurrency at a larger scale would inevitably run into long-tail latency walls -- not just once but repeatedly as the codebase grows/changes.

The potential for a code change or addition to produce unexpected synchronous blocking (potentially with GIL held) _somewhere_ seems really high and I have to imagine it would be very hard to avoid in the context of a real application codebase ...


You would have the same problem in Rust/Tokio and many other languages. The compiler is not able to know which calls could block. Suppose you have this async state machine server on a single thread, super fast, but somewhere it opens a file synchronously, let's say on a network mount to make things worse. No way a purely static VM-free language can help against that. Go (see: goroutines) is able to automatically suspend at those points and continue execution elsewhere. It comes with a big runtime but you get what you pay for.


Yes, but there’s nothing akin to the GIL in Rust/Tokio, so the potential impact of blocking is lower, and you can use spawn_blocking to move any blocking code (that you cannot eliminate) off the main worker threads that are polling futures, or use other techniques like run the blocking work on another standard thread and await on a channel for the result.


What's Go going to do if some ffi library written in C does a synchronous syscall? It's in the same situation.


Go discourages C FFI but it runs those in kernel threads just to be safe. If you are using a lot of C FFI calls that all block, then it could be a problem (all cores are blocked).


I would take this benchmark with a huge grain of salt: https://64.github.io/actix/#blazingly-fast-or-not


I think the area is a lot more complex than the relative latency and throughput of synchronous vs async. For the observations in the environments of Ruby and Python, I think the observations feel true, but switching out to different sections of processing to different technologies I suspect different generalizations would have to be made.


Someone please correct me if I'm wrong, but Rust async/await isn't implemented with green threads, but with OS threads. Those can be pre-empted.


Async/await does not use threads of any kind. It creates a state machine.

Executors may execute those state machines on a single thread, or map them to many threads. The latter looks kinda like green threads depending on what your definition of “green thread” is.


It's single threaded and completely userspace. Rust async/await is a fancy way to write poll/epoll. It can limit the amount of time your program is blocked on IO (insofar as the implementation is as you would expect it (not guaranteed by the language)), but it will not do parallelism. For example you wouldn't async/await a prime number calculation--you'd have to use OS threads.


It’s not necessarily single threaded; Rust async runtimes typically schedule tasks on a pool of multiple OS threads, usually one thread per CPU core. In other words, it can be seen as a form of M:N threading.

Regarding the parent’s question, tasks cannot be preempted from their threads. Threads can of course be preempted from their CPU cores by the OS, but if the number of threads equals the number of cores, this will only happen if other processes on the system are competing for CPU time.


nothing about how it works is built into the language, its up to user code/libraries to decide how to do it.

the popular libraries like tokio are using something akin to green threads, though people picky about definitions may not agree to that nomenclature.


> nothing about how it works is built into the language

I think the state machine required to yield, async & await is baked into the language.

But yes, one can implement the Future trait to take finer control.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: