-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catboost training process #832
catboost training process #832
Comments
These python files are normally included in the jar but there are several issues when we try to use this route in our backend. So we add a spark interface to our source code instead.
Initial support for the catboost process is ready. One thing that needs to be double checked is s3 support because I am not sure if CatboostModel.save() first collects the results at the driver and then saves them locally, or if each executor directly saves their result to the provided path. |
When testing the catboost training code on CDSE staging I noticed that a simple job is running for longer than 12 hours. Even though the logs indicate that the catboost model was written to a file after a few minutes and that we reached the end of batch_job.py. Job id: j-24082105e8314df2b9518097d731191b on cdse-staging Driver pod is stuck in running Valid catboost model is written to s3 Spark UI can't be accessed All batch jobs end in Catboost training works on Terrascope |
There is no python process running on the driver. The only ones left are the spark-submit process and the JVM running the batch job. The JVM is stuck in a Sleeping state since the python process shutdown:
The JVM is making zero syscalls except for a single futex_wait that it has been stuck on since the python process shutdown:
I'm not sure yet why this only happens for the catboost batch jobs but so far it occurs 100% of the time. |
Jstack output after 10 minutes Prometheus S3a
Possibly related s3a issue: S3A Deadlock in multipart copy due to thread pool limits. DestroyJavaVM Scala-execution Jstack difference |
sudo lsof -p 1426683 | grep -v /usr syscalls for s3a threads (Gathered using |
All non-daemon threads:
These worker pool threads are hanging on an empty LinkedBlockingQueue and could possibly be hanging because they are waiting for the s3a threads to finish. There is only one syscall for both: |
@JeroenVerstraelen I'd like to simply close this ticket. The focus for catboost should rather be on if current implementation is at all satisfying project needs. |
fit_class_catboost process
Initially only VectorCube as input.
load from geoparquet (load_url), more than 1.
The text was updated successfully, but these errors were encountered: