Skip to content
This repository has been archived by the owner on Jun 25, 2023. It is now read-only.

v1.0b #25

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

v1.0b #25

wants to merge 14 commits into from

Conversation

vzip
Copy link

@vzip vzip commented May 30, 2023

Hi everybody. This is first example based on tornado for keeping req's for resp's , redis with multiple db's and pipelines, ioredis and Asyncio for tasker, and that's all. It's pretty simple but work stable and fast. Easy to extend the clusters of workers for expand power for processing on ML tasks. Because it's only one part in all app that making queue. I did tests on t2.xlarge aws ec2 and all processing was on CPU, i ran 5 ml worker instances it eating 8gb ram stable, and 10ml workers like a 2 cluster it is making average responds quicker x2. I plan early days make tests on GPU.

ps. config.py need to be update for make settings for more clusters run together - soon will released

Thank you All and have a good time!

solution/dev_parallel/server.py Outdated Show resolved Hide resolved
solution/Dockerfile Outdated Show resolved Hide resolved
@vzip vzip requested a review from rsolovev May 31, 2023 02:30
Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vzip, the latest solution launched with errors, here are full logs:
inca-smc-mlops-challenge-solution-84c9b6cf74-z5jwt.log

@vzip
Copy link
Author

vzip commented May 31, 2023

@rsolovev, found a problem with with wrong /dir in supervisord.conf - solved.

@vzip
Copy link
Author

vzip commented May 31, 2023

amazon approved for me the g4dn.2xlarge instance, will try run and optimise amount of workers in solution

@vzip
Copy link
Author

vzip commented Jun 1, 2023

please run test

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vzip here are the logs for the latest commit:
inca-smc-mlops-challenge-solution-758765f579-pwhwd.log

pod is running without restarts, but every curl request (even from pod's localhost) is hanging -- no response. There seems to be no problems with GPU/CUDA --

root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# nvidia-smi 
Thu Jun  1 10:14:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P0    39W /  70W |   7632MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+



root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'Tesla T4'
>>> 

although I cant see any logs related to redis (even unsuccessful ones), and redis-related env is set:

root@inca-smc-mlops-challenge-solution-758765f579-pwhwd:/solution# echo "$REDIS_HOST $REDIS_PASSWORD"
inca-redis-master.default.svc.cluster.local <redacted>

@vzip
Copy link
Author

vzip commented Jun 1, 2023

Thank you for run . checked, the problem was that the server did not validate the data. Solved.

@vzip
Copy link
Author

vzip commented Jun 1, 2023

@rsolovev Added validation of input data in incoming requests. This must fixed previous issue.
Please, try again start test.

p.s. launched on g4dn.2xlarge , look like I can try to fit in one more cluster of workers.
Снимок экрана 2023-06-01 в 14 39 36

got some tests , in result only 8 workers(models) can fit in 16gb gpu memory. but it can be probably give better results at large amount of task. Queue made like 1,2,3,4,5 and 7,8,9,4,5 (where 4 and 5 solving task for both trains) , will try chose 2 the most faster from this 5 models and put them on double work.
Снимок экрана 2023-06-01 в 21 52 13

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vzip thank you! api now responds with the intended output, but the key names are a bit off -- it is expected to identify model answers by model's author rather than model's name. Please check this section of readme. Without this format, we won't be able to properly execute tests, can you please change the output? Thank you in advance

p.s the result I've got for my curl was:

curl --request POST \
>   --url http://localhost:8000/process \
>   --header 'Content-Type: application/json' \
>   --data '"I live in London"'

{"twitter-xlm-roberta-base-sentiment": {"score": 0.759354829788208, "label": "NEUTRAL"}, "language-detection-fine-tuned-on-xlm-roberta-base": {"score": 0.9999200105667114, "label": "English"}, "twitter-xlm-roberta-crypto-spam": {"score": 0.8439149856567383, "label": "SPAM"}, "xlm_roberta_base_multilingual_toxicity_classifier_plus": {"score": 0.9999451637268066, "label": "LABEL_0"}, "Fake-News-Bert-Detect": {"score": 0.95546954870224, "label": "LABEL_0"}}

@vzip
Copy link
Author

vzip commented Jun 2, 2023

@rsolovev The output changed. Now by model's author.

@vzip vzip requested a review from rsolovev June 2, 2023 19:00
Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @vzip, now everything is perfect! Here are our tests results for the latest commit. If you want you can improve or optimise your solution and re-requests review. Every contribution will count while choosing a challenge winner.

@vzip
Copy link
Author

vzip commented Jun 2, 2023

@rsolovev thank you so much for run test. yes i will update now i have prepared conf for fit all video memory and will compare results

@vzip
Copy link
Author

vzip commented Jun 2, 2023

@rsolovev Please, run another test , will see what more workers can improve.

p.s please give 2 minutes after starting the docker instance so that all workers are loaded into the GPU memory completely before starting the test

@vzip vzip requested a review from rsolovev June 2, 2023 19:51
@vzip
Copy link
Author

vzip commented Jun 3, 2023

Question: I noticed that some participants use model optimization approaches, but in the task it is noted that "Model's performance optimization is not allowed." , from the architecture side, the "bottleneck" is the models and the memory they occupy, if exceptions are allowed in this, then please confirm, I can then completely fit the second group of workers into memory or reduce the time to calculate its answer. I see the possibility of improving the results on the maximum volume of incoming requests by 2 times due to the possibility of optimization. Of course, I still have a plan in reserve to completely split the queues of groups of workers, but when 2 workers from the first group help the second group only because 2 gigabytes of GPU memory was not enough to fill the entire group, it's a shame)

@darknessest
Copy link
Collaborator

Question: I noticed that some participants use model optimization approaches, but in the task it is noted that "Model's performance optimization is not allowed."

Hey there, @vzip, we had an extensive internal debate regarding this, and a compromise we agreed on is:
"As long as the model optimization is done in the runtime, we won't disregard that solution right away"

But please bear in mind, that there won't be any "bonus points" just for a very performant model-optimized solution, as we have various determining factors when choosing the best solution.

@vzip
Copy link
Author

vzip commented Jun 3, 2023

I assume you are on the weekend, but I could not wait to see the result) and deployed your k6 config, and tested + 3 workers, and the result is + 62% to the throughput, in the network between dockers on 1 host it gives out 9303 parrots. And now i can start try optimise work with the the models methods;)
Снимок экрана 2023-06-03 в 16 15 24

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @vzip, thank you for this additions to the submission, here are tests results for the latest commit on our infra

@vzip
Copy link
Author

vzip commented Jun 5, 2023

@rsolovev Thank you for run test. So strange why I see different in my infra. But it is ok, I swap some workers , please run test it again. And will continue try find more efficiently solution.

p.s. thinking maybe redis not in same host can change timings, will cut redis and look, but in my solution redis giving important thing, if on a long distance some will crushed the taks will be safe and for shure will be done. And next onnx will test, because it change speed of solving tasks by model.

@vzip vzip requested a review from rsolovev June 5, 2023 14:04
Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @vzip, here are results for the latest commit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants