Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC file handles not closing #86

Open
ross-spencer opened this issue Dec 20, 2023 · 2 comments · May be fixed by #89
Open

GRPC file handles not closing #86

ross-spencer opened this issue Dec 20, 2023 · 2 comments · May be fixed by #89

Comments

@ross-spencer
Copy link
Collaborator

We've been running COOP as a long-running service for a decent while now. Every so often, we're seeing services stop with too many open file handles: Network.Socket.accept: resource exhausted (Too many open files):

Are there any known improvements in the underlying GRPC libraries that we can leverage here to prevent this? Or is there something else we can see in the code that might have been missed preventing file handles from being closed?

@bladyjoker
Copy link
Collaborator

Some relevant context yesodweb/wai#825

@ross-spencer
Copy link
Collaborator Author

Asked to recreate this I simply setup a bash script to submit transactions and keep track of file handles:

#! /usr/bin/bash

count=1000
for i in $(seq $count); do

echo "============"
echo "iteration $i"
echo "============"


ID=$(uuid -v4)

echo $ID

echo $(json-fs-store-cli insert-fact-statement --db $JS_STORE_DIR/json-store.db --fact_statement_id "$ID" --json '{"@context": "https://schema.org", "type": "PropertyValue", "name": "TEST1/TEST2", "value": 0.3142, "valueReference": [{"@type": "PropertyValue", "name": "validFrom", "value": 1705404151}, {"@type": "PropertyValue", "name": "validThrough", "value": 1705404451}], "ArklyID": "4343235324235235432453FAC710", "SecurityToken": "32R43243223432432"} ')

REQ=$(cat <<EOF
    {
        "fsInfos": [
            {
                "fsId": "$(echo -ne $ID | base64)",
                "gcAfter": {
                    "extended": "NEG_INF"
                }
            }
        ],
        "submitter": {
            "base16": "$SUBMITTER_PKH"
        }
    }
EOF
)


COOP_PROTO=coop-proto

RESP=$(echo $REQ | grpcurl -insecure -import-path $COOP_PROTO \
 -proto $COOP_PROTO/publisher-service.proto -d @ \
 localhost:5080 coop.publisher.Publisher/createMintFsTx
)

echo "fs handles ($i): " $(lsof | grep "json-fs" | wc -l && date)
echo "publisher handles ($i)" $(lsof | grep "coop-publisher" | wc -l && date)
echo "coop-pab handles ($i)" $(lsof | grep "coop-pab" | wc -l && date)

echo $(sleep 120)

echo "done"

done

We basically see the publisher grpc or the local fact store fail in different ways:

============
iteration 511
============
3a69e849-4d49-4072-ae96-4610c046ac31
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "3a69e849-4d49-4072-ae96-4610c046ac31",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
Failed to dial target host "localhost:5080": read tcp [::1]:33474->[::1]:5080: read: connection reset by peer
fs handles (iter. 511):  543 Thu Jan 18 05:31:14 UTC 2024
publisher handles (iter. 511) 0 Thu Jan 18 05:31:14 UTC 2024
coop-pab handles (iter. 511) 34 Thu Jan 18 05:31:14 UTC 2024

^^^
Network.Socket.accept: resource exhausted (Too many open files)                                                                                                        
[coop-env ~ coop.0] $ coop-publisher-cli publisher-grpc      

Second failure:

============
iteration 136
============
d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "d85ec7ff-a3eb-4d96-bbc3-d8b1013fb9c9",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
  Code: Internal
  Message: Network.Socket.connect: <socket: 1023>: does not exist (Connection refused)
fs handles (iter. 136):  0 Fri Jan 19 14:02:33 UTC 2024
publisher handles (iter. 136) 1 Fri Jan 19 14:02:33 UTC 2024
coop-pab handles (iter. 136) 34 Fri Jan 19 14:02:33 UTC 2024

done

^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "3a4c52d6-aed1-4f17-82c9-4e20238fd4cc"]
Network.Socket.accept: resource exhausted (Too many open files)

After setting ulimit -n 65535 we see the failures again but slightly different messages:

============                                                                                                                                                           
iteration 508                                                                                                                                                          
============                                                                                                                                                           
ccfdc2f7-6e5b-433a-a9df-53d1733a2298                                                                                                                                   
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "ccfdc2f7-6e5b-433a-a9df-53d1733a2298",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]                                                                                                                                   
fs handles (iter. 508):  1052 Tue Jan 23 10:09:10 UTC 2024                                                                                                                   
publisher handles (iter. 508) 1 Tue Jan 23 10:09:11 UTC 2024                                                                                                                 
coop-pab handles (iter. 508) 35 Tue Jan 23 10:09:11 UTC 2024                                                                                                                 
                                                                                                                                                                       
done                                                                                                                                                                   
============                                                                                                                                                           
iteration 509                                                                                                                                                          
============                                                                                                                                                           
53543f1b-4b91-46cf-bbb5-8db68fe954fc                                                                                                                                   
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "53543f1b-4b91-46cf-bbb5-8db68fe954fc",SQLBlob "{\"@context\": \"htt
ps://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom
\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"Securi
tyToken\": \"32R43243223432432\"} "]                                                                                                                                   
fs handles (iter. 509):  1053 Tue Jan 23 10:11:21 UTC 2024                                                                                                                   
publisher handles (iter. 509) 1 Tue Jan 23 10:11:22 UTC 2024                                                                                                                 
coop-pab handles (iter. 509) 35 Tue Jan 23 10:11:22 UTC 2024      

============
iteration 510
============
41c14ca9-39ae-4c54-8708-bb8ef0aebce4
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "41c14ca9-39ae-4c54-8708-bb8ef0aebce4",SQLBlob "{\"@context\": \"https://schema.org\", \"type\": \"PropertyValue\", \"name\": \"TEST1/TEST2\", \"value\": 0.3142, \"valueReference\": [{\"@type\": \"PropertyValue\", \"name\": \"validFrom\", \"value\": 1705404151}, {\"@type\": \"PropertyValue\", \"name\": \"validThrough\", \"value\": 1705404451}], \"ArklyID\": \"4343235324235235432453FAC710\", \"SecurityToken\": \"32R43243223432432\"} "]
ERROR:
  Code: Unavailable
  Message: error reading from server: EOF

^^^^
"Sending CreateMintFsTxReq to TxBuilder: {fact_statements { fs_id: \"41c14ca9-39ae-4c54-8708-bb8ef0aebce4\" fs { pdmap { kvs { key { pdbytes: \"@context\" } value { pdbytes: \"https://schema.org\" } } kvs { key { pdbytes: \"ArklyID\" } value { pdbytes: \"4343235324235235432453FAC710\" } } kvs { key { pdbytes: \"SecurityToken\" } value { pdbytes: \"32R43243223432432\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"TEST1/TEST2\" } } kvs { key { pdbytes: \"type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"value\" } value { pdconstr { index: 3 fields { pdint: 3142 } fields { pdint: 18446744073709551612 } } } } kvs { key { pdbytes: \"valueReference\" } value { pdlist { elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validFrom\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404151 } } } } elements { pdmap { kvs { key { pdbytes: \"@type\" } value { pdbytes: \"PropertyValue\" } } kvs { key { pdbytes: \"name\" } value { pdbytes: \"validThrough\" } } kvs { key { pdbytes: \"value\" } value { pdint: 1705404451 } } } } } } } } } gc_after { } } submitter { base16: \"bedfff1b6762f7a95437ab8ca30b3244dfe44029125d23656062a8b6\" }}"
coop-publisher-cli: file descriptor 1024 out of range for select (0--1024).
Recompile with -threaded to work around this.
============
iteration 1
============
c9bacac0-80a6-4422-a48c-7d5e786f95ce
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce",SQLBlob "$TEST_STMT"]
fs handles (iter. 1):  1058 Wed Jan 24 09:24:24 UTC 2024
publisher handles (iter. 1) 1 Wed Jan 24 09:24:24 UTC 2024
coop-pab handles (iter. 1) 34 Wed Jan 24 09:24:24 UTC 2024
^C
(venv) [coop-env ~ coop.0] $ nano XTEST.sh 
(venv) [coop-env ~ coop.0] $ ./XTEST.sh 
============
iteration 1
============
467c3877-eac8-4edd-af6f-5be40211fee7
INSERT INTO "fact_statements"("fact_statement_id", "json") VALUES (?, ?); -- With values: [SQLBlob "467c3877-eac8-4edd-af6f-5be40211fee7",SQLBlob "$TEST_STMT"]
ERROR:
  Code: Internal
  Message: HandshakeFailed (Error_Misc "Network.Socket.recvBuf: resource vanished (Connection reset by peer)")

^^^
SELECT "t0"."fact_statement_id" AS "res0", "t0"."json" AS "res1" FROM "fact_statements" AS "t0" WHERE ("t0"."fact_statement_id")=(?);
-- With values: [SQLBlob "c9bacac0-80a6-4422-a48c-7d5e786f95ce"]
json-fs-store-cli: file descriptor 1029 out of range for select (0--1024).
Recompile with -threaded to work around this.

We're seeing the json-fs-store-cli handles going up consistently (perhaps there's something being missed when closing the sqlite connection?) -- as for the other file handles, there doesn't seem to be a correlation in what I can gather with these techniques; other than it doesn't take much more than 1000 iterations to see the two different failures.

Let me know if there's anything else I can test/look for.

@chfanghr chfanghr linked a pull request Mar 20, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants