Using multiple cores, the basics

The problem with multicore code

Having an application use more than one CPU during its execution introduces an explosion of execution paths. Instead of a simple “first Java executes the first line of this method, then, it executes the second line of this method, all the way to the end,” it becomes: “An arbitrarily chosen thread chooses an arbitrary number of instructions to execute, then the VM will make an arbitrary choice as to whether or not to make any changes to fields this thread performed visible to other threads unless you’ve done a good job on reading the Java memory model to ensure propagation of these changes.”
The former is easily understood. The latter is (largely) untestable and unfollowable, and therefore, it’s easy to write code that looks good, passes all tests, runs great on your machine… and bombs in production five days later, in a way that is utterly mystifying; no exception pointing at the problem. The problem won’t be reproducible (do the same thing, click the same buttons, feed in the same data… and now it works fine. And then tomorrow when that important customer logs in, it’ll fail again on you). Finding a bug like this is literally on the order of 100 to 1000 times harder to find, so avoiding even one such bug is ‘worth’ having 100 others.
Nevertheless, using all the cores in your CPU is:

Practical, in that just running it all on one can be unacceptably wasteful for performance, but more importantly
Inevitable, because you don’t write every line of code in your app yourself (you use libraries; the core java.* libraries at the very least), and THOSE will sometimes use threads.

So, what is a programmer to do?
You may feel like the only right answer is for the programmer to completely understand threading and learn how to write bug free code (given that tests can’t really find these bugs anyway, and even if you so happen to run into one, they are hard to reproduce and its hard to figure out which line(s) of code are faulty when you do find them). This means reading up on the entire Thread API and reading the Java Memory Model front-to-back (If thread 1 writes some data to some field, and sometime later thread 2 reads this field, it may or may not see the change. The JMM defines when Java is, and more importantly is not, required to have one thread ‘see’ updates of another.
But that’s a fool’s errand. You cannot guarantee writing bug free code.
Instead, anytime you have a need to have your code run on multiple cores, try to push towards going big or going small; in both cases, you NEVER call ANY method on Thread (possibly you do some Thread.sleep in the ‘going big’ style), you have no need whatsoever of synchronized, no state is shared between threads, and you don’t create new threads (some framework / library does this for you).

Going Big

Write or use a framework that runs your code for you and which takes care of multithreading. For example, a web framework is such a thing: You start it, and then it runs your code (and it takes care of creating whatever threads are necessary). Generally in these cases you do not write public static void main, or you do, but all that your main does is fire up the framework while you configure it or pass to it a bunch of ‘handlers’, and then the framework does the hard work for you.
Such frameworks are really good at setting up threading. This way each handler can be in its own little world; if a thread shares no data with any other, reasoning about threads is much, much simpler. Take webservers: Any given web handler simply does not get to interact with other handlers. It’s not like you can ask the web framework for a list of WebHandler objects that you can then call methods on.
There is often still a need to have 2 handlers interact. However, you should do this interaction by using tightly controlled communication channels where the issue of how concurrent operations interact is either irrelevant, or very well specified. There are 2 usual ways to go:

Databases. Databases define how concurrently running operations go via transactions. So, use transactions, make your database calls, and the database takes care of it. Communications-via-database is very common in web frameworks. You can use JDBI if you want to write SQL directly, or use Hibernate if you just want to store objects persistently.
Message queues, such as RabbitMQ. The idea behind these is that one handler will tell the message queue framework that it is interesting in all ‘foo’ events, and the other will tell the message queue framework: Here’s a ‘foo’ event, please deliver it to all handlers that said they’d like to know about it. All communication between handlers goes via the message queue.

Editor’s note: the projects mentioned here are very far from being your only choices. JDBI and Hibernate are good, but note that there are projects like myBatis, jOOQ, and others for SQL-ish libraries, and Hibernate is only one JPA implementation among many. Likewise for messaging: RabbitMQ is an AMQP 0.9 server, but you don’t use the JMS standard to talk to RabbitMQ; you do use JMS when using ActiveMQ, HornetQ, or any other of … quite a few messaging platforms. This is not saying that RabbitMQ or Hibernate, et al, are bad – your Editor uses them daily – but remember that your choices are not limited.

Going Small

Reduce the code that needs to run multithreaded to the smallest possible thing it can be, and then write a single stream/collection based operation that uses threading to apply this operation to a great many inputs concurrently. The fork/join framework is the usual go-to here.
Imagine you are writing a bitcoin miner. This operation can be described as follows:
GIVEN: A list of a few million randomly generated codes to try, and exactly 1 input block.
TASK: Write the hash into that 1 input block, hash it, and see if the hash ends up having the appropriate amount of 0s at the very end.
You can do this by going small: Write a trivial method (it won’t be larger than half a screen’s worth, common in this ‘go small’ model) which injects the code into the block, hashes it, and returns an empty string if the hash is not suitable, and if you hit the jackpot, returns the block. Then tell the framework you only want the non-empty returned values and voila, you’ve written a parallel bitcoin miner without ever having to touch java.lang.Thread.

Libraries

In practice, especially for the ‘going big’ route, you may need to have a cache or some such that should be shared between whatever threads your web framework is making for you, but try to find libraries for this, too.¹ There are some tricks to interacting with such libraries. Generally, you have to go ‘atomic’. For example, you can use the various collections in the java.util.concurrent package, but, the only guarantees it can make is that a single method call does the right thing. It cannot guarantee that a series of calls does the right thing. So, don’t do this:

if (concurrentMap.containsKey(myKey)) {
    String v = concurrentMap.get(myKey);
    operateOn(v);
}

Instead, do this:

String v = concurrentMap.get(myKey);
if (v != null) {
    operateOn(v);
}

The bad example can cause a problem: What if, in between the call concurrentMap.containsKey and the call to concurrentMap.get, some other thread removes the entry from the map? You’ll now call operateOn on a null value, not what was intended. Problematically, if this can happen, it will happen, but only very very rarely. Tests are unlikely to catch this problem.
Don’t do this:

String v = myCache.get(myKey);
if (v == null) {
    v = doExpensiveCalculationOfValue(myKey);
    myCache.put(myKey, v);
}

Instead, do this:

myCache.computeIfAbsent(myKey, k -> doExpensiveCalculationOfValue(k));

The reason to use computeIfAbsent here is because the first snippet can lead to running doExpensiveCalculationOfValue more than once. In the first snippet, imagine two (or more) threads get to the code with the same key about the same time. They’ll both find that there is no associated value (yet), so they both call doExpensibeCalculationOfValue and then they both set it. With computeIfAbsent, provided you use a properly concurrent map, such as java.util.concurrent.ConcurrentHashMap for example, doExpensiveCalculationOfValue is only ever called once, guaranteed.

Lessons

Use frameworks, either large: Web frameworks, or small: fork/join.
Thread-to-thread communication uses abstractions like message queues or databases. Don’t share state, don’t touch any fields from more than one thread.
If you must have direct thread-to-thread interop, use libraries with collections types that are designed for this, such as java.util.concurrent and guava‘s CacheBuilder stuff. When interacting with these collections, know that consecutive method calls to them have no guarantees of internal consistency, so reduce it to one call. Be aware of methods like java.util.Map‘s computeIfAbsent to enable this.

Footnotes

1 Editor’s note: caches are nontrivial in and of themselves. Go for “transactional caches” if you can – or just trade the speed that a cache would give you for the reliability that using a system of record gives you. Figure out what you need and do what fulfills that, before thinking that a cache is magic performance sauce – and note that the whole point of this article is that “magic performance sauce” usually isn’t what it says it is.