Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catboost training process #832

Open
JeroenVerstraelen opened this issue Jul 24, 2024 · 7 comments · Fixed by Open-EO/openeo-python-driver#304 or #834
Open

catboost training process #832

JeroenVerstraelen opened this issue Jul 24, 2024 · 7 comments · Fixed by Open-EO/openeo-python-driver#304 or #834
Assignees

Comments

@JeroenVerstraelen
Copy link
Contributor

JeroenVerstraelen commented Jul 24, 2024

fit_class_catboost process

Initially only VectorCube as input.
load from geoparquet (load_url), more than 1.

@JeroenVerstraelen JeroenVerstraelen self-assigned this Jul 24, 2024
JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024
These python files are normally included in the jar but there are several issues when we try to use this route in our backend. So we add a spark interface to our source code instead.
JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024
JeroenVerstraelen added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 5, 2024
JeroenVerstraelen added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 5, 2024
@JeroenVerstraelen JeroenVerstraelen linked a pull request Aug 5, 2024 that will close this issue
@JeroenVerstraelen
Copy link
Contributor Author

Initial support for the catboost process is ready. One thing that needs to be double checked is s3 support because I am not sure if CatboostModel.save() first collects the results at the driver and then saves them locally, or if each executor directly saves their result to the provided path.

@JeroenVerstraelen
Copy link
Contributor Author

JeroenVerstraelen commented Aug 22, 2024

When testing the catboost training code on CDSE staging I noticed that a simple job is running for longer than 12 hours. Even though the logs indicate that the catboost model was written to a file after a few minutes and that we reached the end of batch_job.py.

Job id: j-24082105e8314df2b9518097d731191b on cdse-staging

Driver pod is stuck in running
Non job-tracker logs run from 15:04 - 15:06 but a-faa4c66f14414adfbf6da0f8e4871a25-driver pod continues in status running. There are no executor pods,

Valid catboost model is written to s3
Copied the file locally and used it to make predictions

Spark UI can't be accessed
http://a-faa4c66f14414adfbf6da0f8e4871a25.spark-jobs-staging.batch.stag.warsaw.openeo.dataspace.copernicus.eu/ returns a Bad Gateway error.

All batch jobs end in Closing down clientserver connection Error while sending or receiving
I thought this error was specific to the catboost batch jobs but it appears every job ends with this type of 'error'.

Catboost training works on Terrascope
Tested the same job on openeo-dev.vito.be and it works without issues. I also included catboost in the integrationtests.

@JeroenVerstraelen
Copy link
Contributor Author

There is no python process running on the driver. The only ones left are the spark-submit process and the JVM running the batch job.

The JVM is stuck in a Sleeping state since the python process shutdown:

ps -o state= -p 225207
S

The JVM is making zero syscalls except for a single futex_wait that it has been stuck on since the python process shutdown:

strace -p 225207 -e trace=all -v
strace: Process 225207 attached
futex(0x7f5fc114b9d0, FUTEX_WAIT, 37, NULL^Cstrace: Process 225207 detached

I'm not sure yet why this only happens for the catboost batch jobs but so far it occurs 100% of the time.
Meanwhile the driver logs clearly show the application shutting down and py4j closing the clientserver connection:
logs

@JeroenVerstraelen
Copy link
Contributor Author

JeroenVerstraelen commented Aug 25, 2024

Jstack output after 10 minutes
Jstack output after 1 hour

Prometheus
I see 5 prometheus threads that are in WAITING state that should be probably be investigated further (prometheus-http-1-1, ..., prometheus-http-1-5)

S3a
There are 4 threads related to s3a:

  • Timer for 's3a-file-system' metrics system" (TIMED_WAITING)
  • s3a-transfer-unbounded-pool4-t1" (WAITING)
  • s3a-transfer-unbounded-pool2-t1" (WAITING)
  • java-sdk-http-connection-reaper (TIMED_WAITING) (com.amazonaws.http.IdleConnectionReaper)

Possibly related s3a issue: S3A Deadlock in multipart copy due to thread pool limits.

DestroyJavaVM
The main thread has finished and the "DestroyJavaVM" is running waiting for all non-daemon threads to finish executing.

Scala-execution
I also see a singular "scala-execution-context-global" thread that is in a WAITING state.

Jstack difference
There is no difference between the jstack outputs. The only thread that changed is "sdk-ScheduledExecutor-0-0". It changed from TIMED_WAITING to WAITING

@JeroenVerstraelen
Copy link
Contributor Author

JeroenVerstraelen commented Aug 25, 2024

sudo lsof -p 1426683 | grep -v /usr

gdb info threads

syscalls for s3a threads (Gathered using strace -p 1426683 -f -e trace=all -v -o strace_all.txt)

@JeroenVerstraelen
Copy link
Contributor Author

JeroenVerstraelen commented Aug 26, 2024

All non-daemon threads:

[verstraj-local@cdse-staging-workers-315a2310-6r6gd ~]$ sudo jstack 1426683 | grep tid= | grep -v daemon
"pool-37-thread-1" #348 prio=5 os_prio=0 cpu=118.05ms elapsed=59371.64s tid=0x00007f421a0c7000 nid=0x1e4 waiting on condition  [0x00007f41b0016000]
"pool-37-thread-2" #349 prio=5 os_prio=0 cpu=28.71ms elapsed=59371.64s tid=0x00007f421a0c8800 nid=0x1e5 waiting on condition  [0x00007f41ac50c000]
"DestroyJavaVM" #380 prio=5 os_prio=0 cpu=2447.17ms elapsed=59360.74s tid=0x00007f42a801c000 nid=0x26 waiting on condition  [0x0000000000000000]
"VM Thread" os_prio=0 cpu=5838.60ms elapsed=59439.11s tid=0x00007f42a8176000 nid=0x2c runnable  
"GC Thread#0" os_prio=0 cpu=3576.33ms elapsed=59439.13s tid=0x00007f42a8034000 nid=0x27 runnable  
"GC Thread#1" os_prio=0 cpu=3299.10ms elapsed=59438.86s tid=0x00007f4270001000 nid=0x36 runnable  
"GC Thread#2" os_prio=0 cpu=3357.05ms elapsed=59431.33s tid=0x00007f4270035000 nid=0x4c runnable  
"GC Thread#3" os_prio=0 cpu=3268.16ms elapsed=59431.33s tid=0x00007f4270036000 nid=0x4d runnable  
"G1 Main Marker" os_prio=0 cpu=15.71ms elapsed=59439.13s tid=0x00007f42a807d800 nid=0x28 runnable  
"G1 Conc#0" os_prio=0 cpu=2103.51ms elapsed=59439.13s tid=0x00007f42a807f800 nid=0x29 runnable  
"G1 Refine#0" os_prio=0 cpu=7.61ms elapsed=59439.13s tid=0x00007f42a80d6800 nid=0x2a runnable  
"G1 Young RemSet Sampling" os_prio=0 cpu=15916.61ms elapsed=59439.13s tid=0x00007f42a80d8000 nid=0x2b runnable  
"VM Periodic Task Thread" os_prio=0 cpu=37801.49ms elapsed=59438.45s tid=0x00007f42a8cc6800 nid=0x3a waiting on condition 
"pool-37-thread-1" #348 prio=5 os_prio=0 cpu=118.05ms elapsed=59527.10s tid=0x00007f421a0c7000 nid=0x1e4 waiting on condition  [0x00007f41b0016000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.14/Native Method)
	- parking to wait for  <0x000000074571e488> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.14/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.14/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.14/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.14/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)
"pool-37-thread-2" #349 prio=5 os_prio=0 cpu=28.71ms elapsed=59563.03s tid=0x00007f421a0c8800 nid=0x1e5 waiting on condition  [0x00007f41ac50c000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.14/Native Method)
	- parking to wait for  <0x000000074571e488> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.14/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.14/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.14/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.14/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)

These worker pool threads are hanging on an empty LinkedBlockingQueue and could possibly be hanging because they are waiting for the s3a threads to finish. There is only one syscall for both:
1427811 futex(0x7f421a0c958c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

@jdries
Copy link
Contributor

jdries commented Sep 26, 2024

@JeroenVerstraelen I'd like to simply close this ticket. The focus for catboost should rather be on if current implementation is at all satisfying project needs.
What about scheduling a simple system.exit(0) somewhere at the end of the batch job?
Other option is that kubernetes kicks in and cleans up somehow, but then it needs to be able to detect 'hanging' batch jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants