Some further tests have shown, that everything is fine as long as I start only 16 clients. My CPU has 16 cores and 32 threads. After starting 2 or 3 more clients CPU goes up to 100%.
So it seems if you go start more clients than you have real CPU cores you spend a lot of time in context switches. Kernel system time is quite low, around 10% to 15%, so it seems that is not on the windows side, maybe more on the Java side.