Interesting Links, 14 April 2016

  • Natty is a natural language date parser written in Java. The idea is that you feed it corpora like "1984/04/02", "february twenty-eighth", or "3 days from now", and get back a list of potential matching Date objects. It is not designed to pull dates out of natural language – for that you’d want something like OpenNLP – but it might be able to help convert the natural language dates you get from OpenNLP into Java’s Date representations.
  • Scriptus is a Maven plugin that writes the Git version into build properties. It’s not entirely well-documented, but that’s what open source might be able to fix, right?
  • In When Interviews Fail, Ted Neward deconstructs a DZone article (“Can You Call Non-Static Method From a Static?“). It’s not difficult to imagine the activity – after all, this site does it to authors all the time! – but Ted’s especially good at it. In this case, he’s actually trying to dig at the purpose of an interview in the first place – and closes with “what do you really interview for?”, because if you’re interviewing for some grunt who can answer the corner cases, that’s… all you’re going to get.
  • The jOOQ blog asks: “Would We Still Criticise Checked Exceptions, If Java had a Better try-catch Syntax?” History says that yes, people would criticize Java for pretty much anything they can think of, and a few things they can’t. But in this case, it’s talking about potential syntax where the try is optional. It’s not present in a real compiler, and it does have some syntactic clarity to it – but in this author’s opinion, it’s actually hiding some pretty important information (namely, that you’re entering a try/catch block, which is pretty relevant information.)
  • From DZone: Properly Shutting Down An ExecutorService shows us a Spring bean to manage an ExecutorService shutdown. This, in itself, is a good thing. However, the interesting thing is that he wrote this because Tomcat was failing to kill the ExecutorService itself – he’s basically illustrated why doing thread management in a web application is a bad idea. Let the container manage the threads, people. (This has always been in the specification – the apps are not supposed to start threads. Ever. Use message-driven beans, or timers, or a ManagedExecutorService.)

BTW, feel free to send me Java-related (or somewhat Java-related!) links you think are worth retaining!

Byte Order Marks (BOM)

The so-called Byte Order Mark is a special unicode character that has no visual representation. The point of it, is to start your text data with this pseudocharacter; it serves as a way to identify “Endianness” – that the text is encoded with UTF-16 (Little Endian), or UTF-16 (Big Endian), or UTF-8.
Java handles it kinda weirdly; this post describes how it works.
The BOM is the bytes: 0xFE 0xFF. That means:

Encoding First bytes in the stream
UTF-16, Little Endian FF FE
UTF-16, Big Endian FE FF

You can use these to identify streams.
In Java, the BOM is left in the stream data. So, if you for example have a UTF-8 text file that starts with a BOM, and you read it into a String, your string starts with the BOM character. It doesn’t show up when you print it, but it still ‘counts’, in the sense that the .length() call on your string counts the BOM as 1 character, and a string that starts with a BOM is not equal to one that doesn’t start with it, even if they are otherwise the same. You probably want to filter it out!!
The only exception is the special encoding UTF-16. This encoding will, if it’s there, consume the BOM and use it to configure itself as Little Endian or Big Endian. If there is no BOM, it defaults to big endian. Note that this ‘consume the BOM’ behaviour does not apply to the encodings UTF-16LE and UTF-16BE. They read the BOM as normal.
NB: Esoteric note: The unicode character 0xFF 0xFE is intentionally defined as not valid, so that the BOM can be used unambiguously as indicating the endianness of a UTF-16 stream. However, in java, reading this special invalid character does not throw an exception. You can therefore read the byte data: FF FE 41 00, which is the string "A" encoded with a BOM as UTF-16 Little Endian using the encoding UTF-16BE. This produces garbage, but does not throw an exception.