Understanding ArcadeDB local graph traversal performance #1110

SevorisDoe · 2023-06-04T12:55:26Z

SevorisDoe
Jun 4, 2023

I've been looking at graph databases for a while, and ArcadeDB has appeared on my radar repeatedly. One thing I am trying to understand better to place it in the landscape of graph-capable graph databases is the indexing and access strategies. For this, I am trying to understand some things.

I have seen repeated mention that assuming one knows the record ID, lookup of the edge or vertice can be performed in constant time. How does that work/is that achived under the hood?
Looking at the source code for MutableVertex there seems to be an implification that any vertex loaded from the database has a per-vertex list of all its ingoing and outgoing edges.

To me that implies that you can perform extremely fast graph traversals in ArcadeDB since there is never a need for large linear scans or 𝒪 log(n) index scans of global edge lists - once you have found your starting vertex, you perform a local scan of its presumed sparse edge list, where you can recover all the information of the edges pointed to in constant time, thus also find the vertex at the other end in constant time, perform selection/updates on it, and so on. So any graph traversal operations/edge-and-connected-vertex-properties selections that only touches a small section of the larger graph should run very quickly in comparison to a query that runs on most of the graph?

Answered by lvca

Jun 4, 2023

That's 100% accurate :-)

The O(1) time when you retrieve any record by its ID (RID -> RecordID) is because the RID it's encoded the position of the record in the file. For example, the RID #12:1000000 means file number 12 (they are mapped in the schema.json by number) and position in file = 1,000,000. This position is not the actual byte where the record starts. By default, the buckets are created with a 64KB page size, and the maximum number of records per page is set to 2048 (by default).

So in order to find the physical location where the record is stored, you can apply this simple math: 1,000,000 / 2,048 = 488.28125. That means page 488, and location on page = 576 (1,000,000 mod 2048).

…

View full answer

lvca · 2023-06-04T16:04:43Z

lvca
Jun 4, 2023
Maintainer

That's 100% accurate :-)

The O(1) time when you retrieve any record by its ID (RID -> RecordID) is because the RID it's encoded the position of the record in the file. For example, the RID #12:1000000 means file number 12 (they are mapped in the schema.json by number) and position in file = 1,000,000. This position is not the actual byte where the record starts. By default, the buckets are created with a 64KB page size, and the maximum number of records per page is set to 2048 (by default).

So in order to find the physical location where the record is stored, you can apply this simple math: 1,000,000 / 2,048 = 488.28125. That means page 488, and location on page = 576 (1,000,000 mod 2048).

At this point you can just start reading the record at offset 488 (page) * 65538 (64KB page size). At the head of every page, there is a header that contained information about the record (how big is it, if it's stored on multiple pages, etc).

You can look in the code at here:

int pageId = (int) (rid.getPosition() / maxRecordsInPage);
int positionInPage = (int) (rid.getPosition() % maxRecordsInPage);

3 replies

lvca Jun 4, 2023
Maintainer

Also, with ArcadeDB we store the pointers to the edges (always as RIDs) in a linked list ordered as LIFO (useful to retrieve the latest inserted edges by default without the need to retrieve and order them, if the use case can take advantage of it).

This linked list not only stores the RID of the edge but also the RID of the vertex. This allows you to jump between vertices without loading the edge. This is a very typical use case when edges have no information but are used only to connect vertices.

Last but not least, RIDs are stored compressed, so the RID #1:10 takes only 2 bytes (1 byte for 1 and 1 byte for 10). We use 7 bits of the 8-bit available in a byte for such numbers and 1 bit only tells if the number continues on the next byte.

lvca Jun 4, 2023
Maintainer

With ArcadeDB we were able to implement many ideas we had during the many years of the making of OrientDB, but we couldn't actually do them for many reasons. For users coming from OrientDB, it's easy to see 10X performance on the same HW with just a migration to ArcadeDB.

SevorisDoe Jun 4, 2023
Author

Thank you for the in-depth explanation!

And yeah, you realized some very cool ideas with this architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding ArcadeDB local graph traversal performance #1110

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Understanding ArcadeDB local graph traversal performance #1110

SevorisDoe Jun 4, 2023

Replies: 1 comment · 3 replies

lvca Jun 4, 2023 Maintainer

lvca Jun 4, 2023 Maintainer

lvca Jun 4, 2023 Maintainer

SevorisDoe Jun 4, 2023 Author

SevorisDoe
Jun 4, 2023

Replies: 1 comment 3 replies

lvca
Jun 4, 2023
Maintainer

lvca Jun 4, 2023
Maintainer

lvca Jun 4, 2023
Maintainer

SevorisDoe Jun 4, 2023
Author