Java OutOfMemoryError : GC overhead limit exceeded Example

Links to this post
This is continuing article of out of memory error explanation. In this post I will provide how to reproduce Java OutOfMemoryError : GC overhead limit exceeded. We will see how to recreate and what are the impact in monitoring tools.

I have used only JVM 1.8 x64 on windows 7 x64/8gb ram/ 2.5Ghz Core i5 laptop.

Tools : 
IDE : Eclipse
Profiling/Monitoring tool :
1. Visual VM
2. Jconsole (optional)
3. Yourkit (optional)

I am using some JVM flags to get detail GC information and monitoring via JMX. Please see this post in step 1 for detail.
I am using these flags here (xmx to limit heap to have quick error)
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=3000
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Djava.rmi.server.hostname=localhost
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=D:\OOM
-Xmx100m

Now, for the GC overhead error, it is easily untestable , due to GC is unable to free up the heap with its best efforts. The main cause is, GC is taking 98% of CPU time to cleanup heap where heap is not feeing up more than 2%. In our example, we will see GC overhead multiple time as heap is occupied after certain amount of item entry.

Again, to know about the error, you can visit this original post.

Scenario: 

Very simple scenario, I am adding a string in a map(it is costly, if we use array list, it will take more time) in an infinite loop.

Code : 
public class GCOverheadOOM {
    private static Map aMap = new HashMap();
    public static void createGCOverheadOOM(){
        int i = 0;   
        try{
        while (true) {           
            aMap.put(i, "Shantonu adding String");   
            System.out.println("Total Items "+i++);
        }
        }catch(Throwable e){
            System.err.println("\nError after adding "+ aMap.size()+" items");
            e.printStackTrace();
        }
    }
}

And from main method, call this.
GCOverheadOOM.createGCOverheadOOM();
Note : As this is a GC related error (overhead) , this error fully depends on GC algorithm. This code generates the error in default or parallel GCs. When I used different, I got slightly different one. I have tried following jvm flags to select GC algorithm. Each at a one time
1. -XX:+UseParallelGC -XX:-UseParallelOldGC
2. -UseParNewGC -XX:+UseConcMarkSweepGC
3. -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
4. -XX:+UseG1GC
5. -Xincgc
6. -XX:+UseSerialGC
7. -XX:+UseParallelGC

Oracle has clear indication here.

Error analysis in console  : 

Error Occurred after 1488049 items added. We can see multiple OOM messages for each try by GC.
image

Dump Analysis in Visual VM (created at OOM ): I am using one of errors

Summary : 
 
 

Top contributors : 
image

Visual VM Monitoring : 
GC before ending :
image

Heap :
image


JConsole Monitoring : (overall)
image


Yourkit Monitoring :

CPU usages :
image

Heap : 
image

Non Heap : 
image

Please comment if you have any question.

Thanks.. :)

Java OutOfMemoryError :Unable to create new native thread, JVM Exception/Error

Links to this post
This is continuing article of out of memory error explanation. In this post I will provide how to reproduce Java OutOfMemoryError :Unable to create new native thread . We will see how to recreate and what are the impact in monitoring tools. 

I have used JVM 1.6 x64 & JVM 1.8 x64 on windows 7 x64/8gb ram/ 2.5Ghz Core i5 laptop.

Tools : IDE : Eclipse
Profiling/Monitoring tool :1. Visual VM
2. Jconsole (optional)
3. Yourkit (optional)

I am using some JVM flags to get detail GC information and monitoring via JMX. Please see this post in step 1 for detail.
Now, for the nature of the error, it is easily untestable , due to PC’s limitations, JVM is unable to create new thread. Again, to know about the error, you can visit this original post.

Scenario :
I will write a function which will actually create many threads, start them, make them a demon thread(so that they are kept running) and store all thread in a collection. So that , the number of thread grows and we can get the error.

Code :
public class ThreadOOM {
    public static void createOOMbyThread(){
        List<Thread> allThreads = new ArrayList<Thread>();
        try{
        while(true)   
        {
            long startTime = System.currentTimeMillis();//to know timing is optional           
            Thread aThread = new Thread(new Runnable(){// Anonymous runnable class
                  @Override
                  public void run() {
                      //System.out.println("Thread is running");
                    try {
                      while (!Thread.interrupted())//this will sleep the thread untill it is stopped. 
                      {                         
                        Thread.sleep(1);
                      }} catch (InterruptedException ignored) {}}});
                aThread.setDaemon(true);
                aThread.setPriority(Thread.MIN_PRIORITY);
                allThreads.add(aThread);
                aThread.start();   
            long endTime = System.currentTimeMillis();//to know timing is optional
            Thread.sleep(1);//if you need more wait, you can avoid to have quick impact
            System.err.println("Total Thread = "+allThreads.size()+" Required time(ms) = "+(endTime-startTime));           
        }
        }catch(Throwable e){
            System.err.println("\nError at "+ allThreads.size()+" threads");
            e.printStackTrace();
        }   
}}


And from main method , just call
ThreadOOM.createOOMbyThread();
Source in Github
Error analysis in console  : 
Java 6 : Error happened in 3007 thread.
image

BTW, this number is not same always, it is based on CPU/Memory load. When I attached yourkit agent , it was less in numbers.

Java 8 : Amazingly, Java 8 went up to 200906 thread. For less delay, I commented thread delay (2nd one) so that it moves faster. My PC become very slow resource usages were very very high. It seems Java 8 have separate mechanism to provide very high performance. 
image

Java 6 Monitoring in Visual VM : 
I attached visual VM next time, and got error 3015 threads. So, the console error. 
image

As, JVM run user application by its own demon threads, so, in visual VM the count is little extra to 3030 (our created demon threads + predefined demon threads)
image

And, if we look at GC activity, we can see how much CPU it was  taking
image

You may noticed, 727% , actually JVM consider single thread as 100% CPU usages. And if you have multiple core or thread, it will multiply. In case of tweak/boost/overclock CPU it will increase the %. As my CPU was Core i5 2.5Ghz to boostable upto 3.1Ghz and when JVM starts it used to be running below 2Ghz. So, overall, it multiply single CPU thread performance and number of thread.
And heap usages

image

PermGen usages
image


Java 6.0 Monitoring in Yourkit: 
When I attached your kit the thread OOM happened in 2228 threads and heap was this.
image

As, yourkit has some initial delay to attach with application, if we see from your kit, we may see less number of thread. In  my case it was 2185.
image

and this was Heap usages 
image


Java 8 Monitoring : 
Visual VM : When I used visual VM, OOM occurred after 189772 threads but, it also cased
1. stack over flow error
2. Visual VM crashed
image

So, in that I should say, it was possible to create OOM with profiler but not possible properly to diagnosis. Similar story happened for Your kit also.

These are exceptions from log (in case you are interested)
Internal exceptions (10 events):
Event: 4.782 Thread 0x0000000261194800 Exception <a 'java/lang/NoSuchMethodError': <clinit>> (0x00000000d797dd50) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jni.cpp, line 1598]
Event: 18.312 Thread 0x0000000261194800 Implicit null exception at 0x0000000002ba9985 to 0x0000000002ba9b7d
Event: 18.312 Thread 0x0000000261194800 Implicit null exception at 0x0000000002b86f40 to 0x0000000002b86fc1
Event: 20.038 Thread 0x000000057ff42000 Implicit null exception at 0x0000000002bfb3ab to 0x0000000002bff905
Event: 2377.189 Thread 0x0000000bc2404800 Exception <a 'java/net/SocketException': Connection reset by peer: socket write error> (0x00000000dc4f84a8) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jni.cpp, line 735]
Event: 2377.194 Thread 0x0000000bc2404800 Exception <a 'java/net/SocketException': Software caused connection abort: socket write error> (0x00000000dc4f8ab0) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jni.cpp, line 735]
Event: 2377.224 Thread 0x0000000bc2404800 Implicit null exception at 0x0000000002dd807f to 0x0000000002dd81f1
Event: 2377.229 Thread 0x0000000bc2404800 Exception <a 'java/net/SocketException': Software caused connection abort: socket write error> (0x00000000dc50fc10) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jni.cpp, line 735]
Event: 2377.232 Thread 0x0000000bc2404800 Exception <a 'java/net/SocketException': Software caused connection abort: socket write error> (0x00000000dc5102b8) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jni.cpp, line 735]
Event: 2377.233 Thread 0x0000000bc2404800 Exception <a 'java/security/PrivilegedActionException'> (0x00000000dc510468) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 1382]

Yourkit :

I was trying to add yourkit to profiler , then I got error  after 225167 threads. I found this in my console.
image
image
From JVM logs, I found this error EXCEPTION_ACCESS_VIOLATION

It is easily understandable this error occurred due yourkit  was trying to access JVM which was running with high PC resource and could not load JMX thread to connect with your kit. So these are exceptions (last 10)  from error log. (in case you are interested to know.

Internal exceptions (10 events):
Event: 0.221 Thread 0x00000000021e8000 Exception <a 'java/lang/ArrayIndexOutOfBoundsException'> (0x00000000d606de18) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\runtime\sharedRuntime.cpp, line 605]
Event: 0.221 Thread 0x00000000021e8000 Exception <a 'java/lang/ArrayIndexOutOfBoundsException'> (0x00000000d606f650) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\runtime\sharedRuntime.cpp, line 605]
Event: 0.229 Thread 0x00000000021e8000 Exception <a 'java/lang/ArrayIndexOutOfBoundsException'> (0x00000000d60956e0) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\runtime\sharedRuntime.cpp, line 605]
Event: 0.236 Thread 0x00000000021e8000 Exception <a 'java/lang/ArrayIndexOutOfBoundsException'> (0x00000000d60a4fe8) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\runtime\sharedRuntime.cpp, line 605]
Event: 0.273 Thread 0x00000000021e8000 Exception <a 'java/security/PrivilegedActionException'> (0x00000000d60f2058) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 1382]
Event: 0.273 Thread 0x00000000021e8000 Exception <a 'java/security/PrivilegedActionException'> (0x00000000d60f2210) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 1382]
Event: 0.275 Thread 0x00000000021e8000 Exception <a 'java/security/PrivilegedActionException'> (0x00000000d60f5d08) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 1382]
Event: 0.275 Thread 0x00000000021e8000 Exception <a 'java/security/PrivilegedActionException'> (0x00000000d60f5ec0) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 1382]
Event: 0.441 Thread 0x00000000021e8000 Exception <a 'java/lang/ClassNotFoundException': javax/management/remote/rmi/RMIServerImpl_Skel> (0x00000000d6152888) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\classfile\systemDictionary.cpp, line 210]
Event: 332.215 Thread 0x00000000021e8000 Exception <a 'java/lang/OutOfMemoryError': unable to create new native thread> (0x00000000d790d350) thrown at [C:\re\workspace\8-2-build-windows-amd64-cygwin\jdk8u51\3951\hotspot\src\share\vm\prims\jvm.cpp, line 3016]


If you have any question, please feel free to ask.
Thanks .. :)

Performance Matrices: Part 1 : Load Generator Matrices (Response time, Throughput, Errors, Bandwidth)

Links to this post
In this article, we are going to see popular performance metrics/graphs used in performance report for indicating performance of an application.
In previous article, we knew about KPI. KPI is summary of all metrics. So let's discuss about detail performance metrics.

As we know, while doing performance testing, we have several kinds of tools providing performance measurements.
So, these are actually different source of performance metrics.

1. Load Generating tool s (like load runner, jmeter, wapt, Gatling etc)
2. Server/ Client monitoring tool : (like perfmon, perf, proc mon etc)
3. Profiling tool : (Java profilers, Dotnet profilers, Debuggers, Tracers etc) 
4. APM tools : (Dynatrace , Newrelic, AppDynamics etc) 

And added to that, manually we need to provide performance metrics based on performance or business goals. To know more detail about performance goals, you may visit this posts
1. Performance test goals. 
2. What are Performance requirements?

In this first part, I will describe only the measurement that we used get from load generating tools.

Graphs from Load Generator Tools:

For performance testing, we may use load runner or Jmeter or any other tools. I am keeping this as generic.  Load generators creates scenario (different type of user load/stress/soak/spike)externally. Externally means, how those requests are absorbed from client who use it. For example, if you are testing web server, we will consider traffic over specific protocol. If you are testing web service, it will client requesting to web service over protocol. Or, if you are testing Chat server , it will be requests(raw/text/xml) over protocol (UDP/web sockets/XMPP). So, all cases, this is kind of External activity on test system, and we can monitor/measure external behavior. So, for these kinds of tools, following metrics  can represent performance.

A. Response time: 

This is actually a very key indicator of a test system. Response time refers to how the application is responsive to requests. Form any load generator tool we can get response time. And more often , it comes in different type

a. Maximum response time 
b. Minimum response time
c. Average response
d. Mean response time
e. 90th percentile response time 
f. 95th percentile response time 
g. 99th percentile response time 

These are all statically representation of all request sent from tool. To know detail about each , you may google it.

Where to use this? 

1. When we need to represent performance data to business users so that they can reach to customers regarding improvements.

2. When we need to show performance to client/user.

3. When we need investigation for response time distribution to Dev/technical team.

How to use with graph? 

1. Response time Vs time Graph (X axis time, Y axis Response time ): This means, what was the response time for overall/particular steps when time grows.
>For business , this means, over time, application is usable and performing as stable as beginning. It is consistence behavior over time grows.

2. Response time Vs User/Thread Graph (X axis User/Thread, Y axis Response time ): This means, how was the response time overall/particular steps when increment/decrements of User/Thread.
>For business , this means, over time, application is usable and performing as stable as increase/ decrease of user . It is consistence over usages.

Best practices: 

1. To measure maximum delay, Max response time use

2. To benchmark or understand server, 90th percentile response time use (95 & 99 are helper of 90th percentile , often shown beside that)

3. To check SLA response time Avg. response time is used.

Note :
1. Usually Response time is measured by millisecond. We might need to convert into second.
2. For detail visibility, sometimes we need to provide Latency time separately.


B.Throughput : 

Throughput refers to power of an individual system/server. So, it is the main indication of server/system performance. This is very important performance indicator for capacity planning or achieving target SLA. Throughput used is representing in two ways.

a. Hit per second (or min/hour) : Hit per second(or any unit) refers to server power to serve request. This is basic unit count as hit. It is calculated like as total hit or request to server/(time taken to send the request).

b. Request /Transaction per second (or min/hour) : Request per second (or any unit) refers to how many requests are served in server. Usually tools provide each request as each transaction. So, it is also called transaction per second. In all load testing tools, we can also group multiple request into single transaction and measure that transaction as unit (as a whole) to calculate this. In this way, we can represent a business transaction in single measurement. Example : For log in , we go to log in, insert credential and send log in request. By default all these three request will be considered as three transactions, but we can group all together and represent as single transaction.

Now a days, in load runner/jmeter , we can see Hit per second and throughput are measured in separate graph.

Where to use this? 

1. It is mostly use for technical or support team to measure performance as capacity.

2. Business users may use this only Business Transaction format(how much business transaction per second, example : how many debit/credit transaction may represent banking software strength.)

3. This is must while server migration. This is the key indicator in benchmark.

4. This is key indicator for scalability testing. Spatially, scale up & scale down tests.

5. This is also used for testing load balancer & its distribution.

How to use with graph? 

1. Throughput Hit/sec over time : (X axis time, Y axis Hit/sec) : This represent hit/sec over time growth. For unit, you can also use per min or hour. Usually SLAs are defined by hour.

2. Throughput Transaction/sec over time : (X axis time, Y axis Tr/sec) this represents transaction/sec over time growth. Usually this is business transaction or logical transaction based on application requests. For business transaction , SLA definitions are measured per hour.

3. Throughput Hit/sec over user/thread increment or decrements: (X axis user/thread, Y axis Hit/sec)  this represents Hit/sec over user/thread grows. This represents scalability. More often, it is used to measure for server scale up & scale down capability checking.

4. Throughput Transaction/sec over user/thread increment or decrements: (X axis user/thread, Y axis Tr/sec) this represents Transaction/sec over user/thread grows (same as previous). This represents scalability. More often, it is used to measure for server scale up & scale down capability checking.

5. Byte Throughput over time : (X axis time, Y axis Byte/sec) : This is simplified representation of Transaction/sec over time where transaction represented as byte(size of the transaction). This graph shows at a glance throughput of server in byte. This is very useful for both technical and Business users to understand Throughput in simple manner.

Best practice: To understand quickly, these are widely used.
1. HPS(hit/s) over time
2. Tr/min over time (business transaction)
3. Byte throughput over time.


C.Error Percentage: 

This represents how much error occurred during tests. Usually for a web application following error can be occurred.

1. Http errors : For a web application,  all http messages will be consider as error except accepted one. Example, http-200 is always considered as not error where http 500 will be considered as error. All tools like Jmeter support configurable http error checking where you can select your type as not error. Example : IP whitelisted sites used show 404 for forbidden, which refers , the site is up but it is not accessible. For this kind of activity, we need to exclude 404 as error while testing from blocked IP.

2. Assertion error : Like as jmeter , all tools supports step verification. So, if step data/time verification fails, it will show error. This is application specific response assertion.

3. SLA/Response timeout error : For most of request, when we apply SLA or response time out, when server failed to show on the condition, this error occurs.

4. Server Error/Exception : Like as JSP or ASP server, application server may show application error or exception which will be consider as response error.

5. Protocol error: Like as http protocol, all other protocols that we may use in script have its own error types. Tools will consider those as error all the time.

And, many more based on application implementation.

Where to use this? 

1. It is mostly used for technical/Dev team to know the error rate.

2. Error changing rate often used for application debugging for a certain part of a system.

3. Type of error is mostly used to identify key bottleneck for large system.

4. While validating SLA, error % indicates stability of the system. So, it is important where SLA verification is present.

How to use the graph? 

1. All or Type of Error% Rate over time : (X axis Time, Y axis error%)  This represents over time grows, how many % of requests caused error. You can represent in plain error count(not percentage)

2. All or Type of Error% Rate over user/thread increment or decrement: (X axis user/thread, Y axis error%)  : This represents how many % of requests caused error on user/thread increment or decrement. You can represent in plain error count(not percentage)

Best Practice : Error% Rate over time is often using for representing application consistency.

Note : Sometimes this error percentage is also represented as Error only(count of error)


D. Network Bandwidth:  

This is another vital performance indicator. Now a days , we used to have different network protocols and their bandwidth limitations. Infect, different network & proxy configuration can create different network environment.
This measurement indicates how much bandwidth application needs, and how application will perform on predefined bandwidth.
For network emulation , like as Jmeter, most of tools support your custom bandwidth which help you to create your application required real user scenario.
And, most of load generating tools provide measurement of bandwidth consumption. Usually the parameter name is
Request bandwidth(KB/s) : Shows how much data sent received for requesting particular request/transaction.

Where to use this? 
1. Mostly used for viewing required bandwidth measurements

2. Very useful for Cloud Environment application testing due to cost estimation

3. When application migrates from one technology to another, more often it is used with benchmark, showing low resource consumption

4. This is very important for applications which have major user base using verity of network environment. Like mobile usages, pager users, small handheld device users. For this kind of low resource users, this is used to a key performance indicator.

5. When you application need to test in different network configuration and comparison  between them, then this measurement defines different standard. For example, if you are comparing application responses among 115kbps VS 10Mbps , this will show actual bandwidth consumption and real application behavior on response time.

How to use the graph? 

1. Overall/Specific Request Bandwidth over time: (X axis Time, Y axis Bandwidth KB/S)

2. Maximum Overall/Specific Request Bandwidth requirement.(It is a bar chart, each bar for request, bar height is bandwidth)

3. Maximum response time over bandwidth: (X axis Bandwidth, Y axis Response time): It will show how application behaves on different bandwidth.

Best Practice :

1. It is often ignore for Intranet sites

2. It is a must for public facing websites.

3. It is used for mobile apps or firmware client facing application.

4. It is important if your application used in low bandwidth areas, like rural areas where bandwidth is very expensive as well as rear.

5. It is very important for cloud hosted application to measuring the bandwidth requirement & costs.


Please have a comment if you have any question regarding this. Thanks  ?