Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark #4

Open
kadyb opened this issue Jul 10, 2023 · 13 comments
Open

Benchmark #4

kadyb opened this issue Jul 10, 2023 · 13 comments

Comments

@kadyb
Copy link
Contributor

kadyb commented Jul 10, 2023

Two years ago I created the benchmark testset comparing packages for processing vector data (sf, terra, geopandas, geos, s2), and now I'm naturally interested in how rsgeo looks like (and maybe @Robinlovelace too 😃). rsgeo is a low-level interface, so the number of operations is limited -- I only tested the calculation of Euclidean distance between points and geometry intersects (it's pity that buffering is not originally implemented). My conclusions:

  1. In your comparison in README, the calculation of Euclidean distance is more than 10 times faster than in sf. However, for a larger dataset, the difference is about 30% (or another reason, the results can be outdated). It seems to me that here the best solution is to transform the geometry to a data frame and use C++ functions (such as Rfast::Dist()).
  2. I see that rustgeo doesn't support prepared geometries unlike geos, so there are big differences here.
Benchmark
library("sf")
library("geos")
library("rsgeo")
library("Rfast") # for fast Euclidean distance
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 300000
pts_sf = st_sample(poly_sf, n)

pts_rs = as_rsgeom(pts_sf)
pts_geos = as_geos_geometry(pts_sf)
poly_rs = as_rsgeom(poly_sf)
poly_geos = as_geos_geometry(poly_sf)

### distance between points ----------------------------------------------------
system.time(st_distance(pts_sf[1:5000], which = "Euclidean"))
#>  user  system elapsed
#> 1.817   0.519   2.338
system.time(euclidean_distance_matrix(pts_rs[1:5000], pts_rs[1:5000]))
#>  user  system elapsed
#> 1.526   0.320   1.848
system.time({
  crds = st_coordinates(pts_sf[1:5000])
  Rfast::Dist(crds, method = "euclidean")
})
#>  user  system elapsed
#> 0.589   0.120   0.707

### geometry intersects --------------------------------------------------------
system.time(intersects_sparse(pts_rs, poly_rs))
#>  user  system elapsed
#> 1.415   0.072   1.486

system.time(geos_intersects_matrix(pts_geos, poly_geos))
#>  user  system elapsed
#> 0.105   0.000   0.105
@JosiahParry
Copy link
Owner

You're right! Thank you! I'm going to embark on a complete rewrite of rsgeo in the coming weeks. One thing I can do is make sure that all point distance and binary predicates first build an R* tree first.

Although that may be a bit unnecessary for when there are not enough geometries to justify it.

What do you think? Always build the tree or optionally or have separate functions for it?

@JosiahParry
Copy link
Owner

Note that while there aren't "prepared" geometries we can do the work manually on the Rust side by creating the tree and inserting into it.

@Robinlovelace
Copy link

suggestion to make the Rfast one even faster: replace

  crds = st_coordinates(pts_sf[1:5000])

with an sfheaders alternative...

@kadyb
Copy link
Contributor Author

kadyb commented Jul 11, 2023

What do you think? Always build the tree or optionally or have separate functions for it?

Maybe optionally? In {sf} there is the prepared argument. In {terra} it works automatically, but you can use a C++ reference to do it manually (like x@ptr$relate_between()).

BTW1: Will tree building help anything in calculating the distance matrix? I think it is brute calculating the distance between all points?

BTW2: I am not expert, but I understand that you are proposing to implement spatial index and not prepared geometry? I understand that these are two separate things? I rely on this post from Edzer: https://r-spatial.org/r/2017/06/22/spatial-index.html

@JosiahParry
Copy link
Owner

Re slowness of euclidean distance matrix I think this is a result of two things:

  1. excessive use of cloning
    .map(|y| euclidean_distance_impl(x.geom.clone(), &y.geom))
  2. excessive matching

There are two matches per iteration which is quite a lot. I think this can be helped when https://github.com/georust/geo/pull/1029/files/2709d7f841af05551e1d065403a853dbe5fdde70 is merged.

Re prepared geometries (it's suprising that there is like, 0 documentation or academic papers easily found and its all source code and probably github reviews), my understanding prepared geometries comes largely from geo-rust community and discussion georust/geo#803 is that you create a spatial index upfront (this is the preparation). Then the index can be used in subsequent calls etc to make narrowing down candidates far faster.

@kadyb
Copy link
Contributor Author

kadyb commented Jul 18, 2023

@Robinlovelace, little offtopic, but I tried {sfheaders} to extract coordinates from geometry and it is slower than {sf}. However, it doesn't really matter because this operation has little overhead. The second thing, maybe {sfheaders} shouldn't return two columns (i.e. sfg_id, point_id) -- they are rather unnecessary for points.

Benchmark
library("sf")
library("sfheaders")

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 500000
pts_sf = st_sample(poly_sf, n)

t = bench::mark(
  check = FALSE, iterations = 10,
  sf = st_coordinates(pts_sf),
  sfheaders = sfc_to_df(pts_sf)
)
t[, 1:5]
#>   expression      min   median `itr/sec` mem_alloc
#> 1 sf           8.69ms   11.2ms     45.1     15.3MB
#> 2 sfheaders  145.88ms  152.1ms      6.61      21MB

@Robinlovelace
Copy link

Very interesting! I've found sfheaders v. fast for some things but clearly not all. Thanks for all these benchmarks, really helpful if not essential for working out what to use when speed's an issue!

@JosiahParry
Copy link
Owner

JosiahParry commented Aug 25, 2023

This should be vastly improved in the new refactor branch. I just pushed a commit to parallelize the distance matrix.

It's not complete yet but the calculation portion is there. Just need to move from a list to a matrix.

a2ab18a#diff-ba040067b83f26349efd96faedb319914955849e8949c3a157330a1c0d895dae

image

@Robinlovelace
Copy link

The one to compare against is geos, faster than rsgeo for union in mini benchmark I put together: https://ogh23.robinlovelace.net/tidy#rsgeo

@JosiahParry
Copy link
Owner

It's definitely going to remain faster for unioning. I think there will be a minor speed up compared to your existing benchmark but geos is very good here 😛

@kadyb
Copy link
Contributor Author

kadyb commented Aug 25, 2023

BTW @Robinlovelace, isn't the world object CRS WGS84? Then you are comparing the s2 spherical engine in {sf} with planar in {geos} and {rsgeo}. From my observations, in general GEOS is faster than s2. In this case, you can also use the argument is_coverage = TRUE and it will be even faster.

@Robinlovelace
Copy link

BTW @Robinlovelace, isn't the world object CRS WGS84?

Correct, although as we're doing a union the results shouldn't differ too much, right (although I did notice several differences...)? Benchmark adaptations incoming!

@JosiahParry
Copy link
Owner

JosiahParry commented Aug 25, 2023

A note on the distance matrix calculations. Rfast calculates a distance matrix between a vector and itself. sf can do one or the other. When you calculate a distance matrix where y = x you only need to calculate one of the triangles and the diagonal is always 0. When the other vector is not identical you do not have symmetry and cannot shortcut the calculation. When we pass in a value of y for st_distance() the computation is so much slower. And to compare to Rfast we have to use dista().

As of right now rsgeo always computes the full dense matrix. I can and should add an enhancement to check if x is equal to y and short cut it.

Edit: adding bench mark for Rfast and new distance matrix. When comparing apples to apples they are on par with eachother.

library("sf")
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)

n = 5000
pts_sf = st_sample(poly_sf, n)

system.time(st_distance(pts_sf))
#>    user  system elapsed 
#>   0.671   0.094   0.770
system.time(st_distance(pts_sf, rev(pts_sf)))
#>    user  system elapsed 
#>   6.197   0.064   6.278

library(rsgeo)

x <- rsgeo:::from_sfc(pts_sf)

system.time(distance_euclidean_matrix(x, rev(x)))
#>    user  system elapsed 
#>   0.348   0.168   0.321

crds <- coords(x)
c2 <- coords(rev(x))

system.time(Rfast::dista(crds, c2))
#>    user  system elapsed 
#>   0.228   0.034   0.266


bench::mark(
  Rfast = Rfast::dista(crds, c2),
  rsgeo = distance_euclidean_matrix(x, rev(x)),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Rfast         263ms    266ms      3.76     382MB     7.51
#> 2 rsgeo         215ms    252ms      3.97     191MB     1.99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants