Buscar en Mind w/o Soul

miércoles, mayo 28, 2008

Estructuras de datos altamente concurrentes, en Java

Programación paralelizable / paralela multicore multi trhead multiprocesador.
Slashdot | Scalable Nonblocking Data Structures Programming
"InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
Scalable Nonblocking Data Structures
These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

That's not strictly true.

First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.

Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.

Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).

Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.

So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.

No hay comentarios: