catboost training process #832

JeroenVerstraelen · 2024-07-24T13:25:42Z

fit_class_catboost process

Initially only VectorCube as input.
load from geoparquet (load_url), more than 1.

These python files are normally included in the jar but there are several issues when we try to use this route in our backend. So we add a spark interface to our source code instead.

…openeo-geopyspark-driver#832

JeroenVerstraelen · 2024-08-05T17:59:19Z

Initial support for the catboost process is ready. One thing that needs to be double checked is s3 support because I am not sure if CatboostModel.save() first collects the results at the driver and then saves them locally, or if each executor directly saves their result to the provided path.

JeroenVerstraelen · 2024-08-22T07:46:56Z

When testing the catboost training code on CDSE staging I noticed that a simple job is running for longer than 12 hours. Even though the logs indicate that the catboost model was written to a file after a few minutes and that we reached the end of batch_job.py.

Job id: j-24082105e8314df2b9518097d731191b on cdse-staging

Driver pod is stuck in running
Non job-tracker logs run from 15:04 - 15:06 but a-faa4c66f14414adfbf6da0f8e4871a25-driver pod continues in status running. There are no executor pods,

Valid catboost model is written to s3
Copied the file locally and used it to make predictions

Spark UI can't be accessed
http://a-faa4c66f14414adfbf6da0f8e4871a25.spark-jobs-staging.batch.stag.warsaw.openeo.dataspace.copernicus.eu/ returns a Bad Gateway error.

All batch jobs end in Closing down clientserver connection Error while sending or receiving
I thought this error was specific to the catboost batch jobs but it appears every job ends with this type of 'error'.

Catboost training works on Terrascope
Tested the same job on openeo-dev.vito.be and it works without issues. I also included catboost in the integrationtests.

JeroenVerstraelen · 2024-08-22T17:17:51Z

There is no python process running on the driver. The only ones left are the spark-submit process and the JVM running the batch job.

The JVM is stuck in a Sleeping state since the python process shutdown:

ps -o state= -p 225207
S

The JVM is making zero syscalls except for a single futex_wait that it has been stuck on since the python process shutdown:

strace -p 225207 -e trace=all -v
strace: Process 225207 attached
futex(0x7f5fc114b9d0, FUTEX_WAIT, 37, NULL^Cstrace: Process 225207 detached

I'm not sure yet why this only happens for the catboost batch jobs but so far it occurs 100% of the time.
Meanwhile the driver logs clearly show the application shutting down and py4j closing the clientserver connection:
logs

JeroenVerstraelen · 2024-08-25T16:43:25Z

Jstack output after 10 minutes
Jstack output after 1 hour

Prometheus
I see 5 prometheus threads that are in WAITING state that should be probably be investigated further (prometheus-http-1-1, ..., prometheus-http-1-5)

S3a
There are 4 threads related to s3a:

Timer for 's3a-file-system' metrics system" (TIMED_WAITING)
s3a-transfer-unbounded-pool4-t1" (WAITING)
s3a-transfer-unbounded-pool2-t1" (WAITING)
java-sdk-http-connection-reaper (TIMED_WAITING) (com.amazonaws.http.IdleConnectionReaper)

Possibly related s3a issue: S3A Deadlock in multipart copy due to thread pool limits.

DestroyJavaVM
The main thread has finished and the "DestroyJavaVM" is running waiting for all non-daemon threads to finish executing.

Scala-execution
I also see a singular "scala-execution-context-global" thread that is in a WAITING state.

Jstack difference
There is no difference between the jstack outputs. The only thread that changed is "sdk-ScheduledExecutor-0-0". It changed from TIMED_WAITING to WAITING

JeroenVerstraelen · 2024-08-25T17:33:01Z

sudo lsof -p 1426683 | grep -v /usr

gdb info threads

syscalls for s3a threads (Gathered using strace -p 1426683 -f -e trace=all -v -o strace_all.txt)

JeroenVerstraelen · 2024-08-26T07:46:11Z

All non-daemon threads:

[verstraj-local@cdse-staging-workers-315a2310-6r6gd ~]$ sudo jstack 1426683 | grep tid= | grep -v daemon
"pool-37-thread-1" #348 prio=5 os_prio=0 cpu=118.05ms elapsed=59371.64s tid=0x00007f421a0c7000 nid=0x1e4 waiting on condition  [0x00007f41b0016000]
"pool-37-thread-2" #349 prio=5 os_prio=0 cpu=28.71ms elapsed=59371.64s tid=0x00007f421a0c8800 nid=0x1e5 waiting on condition  [0x00007f41ac50c000]
"DestroyJavaVM" #380 prio=5 os_prio=0 cpu=2447.17ms elapsed=59360.74s tid=0x00007f42a801c000 nid=0x26 waiting on condition  [0x0000000000000000]
"VM Thread" os_prio=0 cpu=5838.60ms elapsed=59439.11s tid=0x00007f42a8176000 nid=0x2c runnable  
"GC Thread#0" os_prio=0 cpu=3576.33ms elapsed=59439.13s tid=0x00007f42a8034000 nid=0x27 runnable  
"GC Thread#1" os_prio=0 cpu=3299.10ms elapsed=59438.86s tid=0x00007f4270001000 nid=0x36 runnable  
"GC Thread#2" os_prio=0 cpu=3357.05ms elapsed=59431.33s tid=0x00007f4270035000 nid=0x4c runnable  
"GC Thread#3" os_prio=0 cpu=3268.16ms elapsed=59431.33s tid=0x00007f4270036000 nid=0x4d runnable  
"G1 Main Marker" os_prio=0 cpu=15.71ms elapsed=59439.13s tid=0x00007f42a807d800 nid=0x28 runnable  
"G1 Conc#0" os_prio=0 cpu=2103.51ms elapsed=59439.13s tid=0x00007f42a807f800 nid=0x29 runnable  
"G1 Refine#0" os_prio=0 cpu=7.61ms elapsed=59439.13s tid=0x00007f42a80d6800 nid=0x2a runnable  
"G1 Young RemSet Sampling" os_prio=0 cpu=15916.61ms elapsed=59439.13s tid=0x00007f42a80d8000 nid=0x2b runnable  
"VM Periodic Task Thread" os_prio=0 cpu=37801.49ms elapsed=59438.45s tid=0x00007f42a8cc6800 nid=0x3a waiting on condition

"pool-37-thread-1" #348 prio=5 os_prio=0 cpu=118.05ms elapsed=59527.10s tid=0x00007f421a0c7000 nid=0x1e4 waiting on condition  [0x00007f41b0016000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.14/Native Method)
	- parking to wait for  <0x000000074571e488> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.14/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.14/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.14/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.14/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)

"pool-37-thread-2" #349 prio=5 os_prio=0 cpu=28.71ms elapsed=59563.03s tid=0x00007f421a0c8800 nid=0x1e5 waiting on condition  [0x00007f41ac50c000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.14/Native Method)
	- parking to wait for  <0x000000074571e488> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.14/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.14/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.14/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.14/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.14/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.14/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.14/Thread.java:829)

These worker pool threads are hanging on an empty LinkedBlockingQueue and could possibly be hanging because they are waiting for the s3a threads to finish. There is only one syscall for both:
1427811 futex(0x7f421a0c958c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

jdries · 2024-09-26T14:55:43Z

@JeroenVerstraelen I'd like to simply close this ticket. The focus for catboost should rather be on if current implementation is at all satisfying project needs.
What about scheduling a simple system.exit(0) somewhere at the end of the batch job?
Other option is that kubernetes kicks in and cleans up somehow, but then it needs to be able to detect 'hanging' batch jobs.

JeroenVerstraelen self-assigned this Jul 24, 2024

JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024

rename test_ml_model to test_random_forest #832

e9a312e

JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024

rename geopysparkrandomforestmodel to lower case #832

db54dab

JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024

add fit_class_catboost to AggregateSpatialVectorCube #832

52eaf04

JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024

add catboost tests #832

10bd75f

JeroenVerstraelen added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 5, 2024

add fit_class_catboost process spec Open-EO/openeo-geopyspark-driver#832

9ae5f67

JeroenVerstraelen added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 5, 2024

add fit_class_catboost process to process graph deserializer Open-EO/…

b07a1ef

…openeo-geopyspark-driver#832

JeroenVerstraelen added a commit to Open-EO/openeo-python-driver that referenced this issue Aug 5, 2024

add test_fit_class_catboost Open-EO/openeo-geopyspark-driver#832

eb4b647

JeroenVerstraelen linked a pull request Aug 5, 2024 that will close this issue

832 catboost training process Open-EO/openeo-python-driver#304

Merged

JeroenVerstraelen closed this as completed in Open-EO/openeo-python-driver#304 Aug 5, 2024

JeroenVerstraelen linked a pull request Aug 5, 2024 that will close this issue

832 catboost training process #834

Merged

JeroenVerstraelen added a commit that referenced this issue Aug 5, 2024

cleanup geopysparkcatboostmodel #832

c90a639

JeroenVerstraelen reopened this Aug 22, 2024

JeroenVerstraelen added a commit that referenced this issue Aug 23, 2024

save catboost directly to s3 on kube deploys #832

928c32b

JeroenVerstraelen added a commit that referenced this issue Aug 29, 2024

don't expose catboost_spark outside of ml directory #832

5d56029

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catboost training process #832

catboost training process #832

JeroenVerstraelen commented Jul 24, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 5, 2024

JeroenVerstraelen commented Aug 22, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 22, 2024

JeroenVerstraelen commented Aug 25, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 25, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 26, 2024 •

edited

Loading

jdries commented Sep 26, 2024

catboost training process #832

catboost training process #832

Comments

JeroenVerstraelen commented Jul 24, 2024 • edited Loading

JeroenVerstraelen commented Aug 5, 2024

JeroenVerstraelen commented Aug 22, 2024 • edited Loading

JeroenVerstraelen commented Aug 22, 2024

JeroenVerstraelen commented Aug 25, 2024 • edited Loading

JeroenVerstraelen commented Aug 25, 2024 • edited Loading

JeroenVerstraelen commented Aug 26, 2024 • edited Loading

jdries commented Sep 26, 2024

JeroenVerstraelen commented Jul 24, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 22, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 25, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 25, 2024 •

edited

Loading

JeroenVerstraelen commented Aug 26, 2024 •

edited

Loading