Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding elements/nodes in regex way #412

Open
EkremBayar opened this issue Jun 26, 2024 · 0 comments
Open

Finding elements/nodes in regex way #412

EkremBayar opened this issue Jun 26, 2024 · 0 comments

Comments

@EkremBayar
Copy link

There are multiple ways of selecting elements by using XPath, CSS selectors, regular expression.

To reach some elements easier I've written a function that is used like dplyr functions. This function gathers three functions' features which they are starts_with(), contains() and ends_with().

Before I didn't know using regular expression on web scraping and had no idea about selectors. I've kind of learned them now and I can reach the elements without the function I wrote. However, beginners like me are supposed to research and learn how to reach the elements.

I wonder your opinions, adding a function as a new feature like that in the rvest package makes sense to reach the elements easier?

# Packages
library(rvest)
library(dplyr)

# Function
html_nodes_regex <- function(html, node_name, attr, regex_type = c("equal", "startswith", "contains", "endswith")){

  #https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes
  #https://medium.com/yonder-techblog/css-regex-attribute-selectors-98075b7f4726
  
  # Checks
  if(missing(node_name)){stop("`node_name` cannot be missing!")}
  if(missing(attr)){stop("`attr` cannot be missing!")}
  if(missing(regex_type)){stop("`regex_type` cannot be missing!")}
  if(!is.character(node_name)){stop("The class of `node_name` has to be character!")}
  if(!is.character(attr)){stop("The class of `node_name` has to be character!")}
  if(!is.character(regex_type)){stop("The class of `node_name` has to be character!")}
  if(length(regex_type %in% c("equal","startswith", "contains", "endswith")) != 1){
    stop("`regex_type` has to be one of them: `equal`, `startswith`, `contains` or `endswith`!")
  }

  # Regex Type
  regex_type_check <- switch(regex_type,
                       equal = "",
                       startswith = "^",
                       contains = "*",
                       endswith = "$",
                       stop("Unknown `regext_type!` Type must be `equal`, `startswith`, `contains` or `endswith`", call. = FALSE)
  )

 # Selector Query 
  query <- paste0("[", attr, regex_type_check, "=", node_name, "]")

  # Selecting Elements
  html %>% rvest::html_nodes(query)

}

# Reading the HTML page of the Premier League
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"
page <- rvest::read_html(url)
# Starts with
page %>% html_nodes_regex(node_name = "all_stats_squads_", attr = "id", regex_type = "startswith")
{xml_nodeset (11)}
 [1] <div id="all_stats_squads_standard" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [2] <div id="all_stats_squads_keeper" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_s ...
 [3] <div id="all_stats_squads_keeper_adv" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="sta ...
 [4] <div id="all_stats_squads_shooting" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [5] <div id="all_stats_squads_passing" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_ ...
 [6] <div id="all_stats_squads_passing_types" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id=" ...
 [7] <div id="all_stats_squads_gca" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_squa ...
 [8] <div id="all_stats_squads_defense" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_ ...
 [9] <div id="all_stats_squads_possession" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="sta ...
[10] <div id="all_stats_squads_playing_time" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="s ...
[11] <div id="all_stats_squads_misc" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_squ ...
# Contains
page %>% html_nodes_regex(node_name = "squads_standar", attr = "id", regex_type = "contains")
{xml_nodeset (13)}
 [1] <div id="all_stats_squads_standard" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [2] <span class="section_anchor" id="stats_squads_standard_link" data-label="Squad Standard Stats"></span>
 [3] <div class="section_heading assoc_stats_squads_standard_for" id="stats_squads_standard_for_sh">\n  <span class="s ...
 [4] <span class="section_anchor" id="stats_squads_standard_for_link" data-label="Squad Standard Stats" data-no-inpage ...
 [5] <div class="section_heading hidden assoc_stats_squads_standard_against" id="stats_squads_standard_against_sh">\n  ...
 [6] <span class="section_anchor" id="stats_squads_standard_against_link" data-label="Squad Standard Stats" data-no-in ...
 [7] <div id="switcher_stats_squads_standard">\n\n\t<div class="table_container tabbed current" id="div_stats_squads_s ...
 [8] <div class="table_container tabbed current" id="div_stats_squads_standard_for">\n\t\t\n\t\t<table class="stats_ta ...
 [9] <table class="stats_table sortable min_width" id="stats_squads_standard_for" data-cols-to-freeze=",1">\n<caption> ...
[10] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_for">\n\t\t\n\t\t<small>Totals may not be comp ...
[11] <div class="table_container tabbed" id="div_stats_squads_standard_against">\n\t\t\n\t\t<table class="stats_table  ...
[12] <table class="stats_table sortable min_width" id="stats_squads_standard_against" data-cols-to-freeze=",1">\n<capt ...
[13] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_against">\n\t\t\n\t\t<small>Totals may not be  ...
# Ends with
page %>% html_nodes_regex(node_name = "_for", attr = "id", regex_type = "endswith")
{xml_nodeset (33)}
 [1] <div class="table_container tabbed current" id="div_stats_squads_standard_for">\n\t\t\n\t\t<table class="stats_ta ...
 [2] <table class="stats_table sortable min_width" id="stats_squads_standard_for" data-cols-to-freeze=",1">\n<caption> ...
 [3] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_for">\n\t\t\n\t\t<small>Totals may not be comp ...
 [4] <div class="table_container tabbed current" id="div_stats_squads_keeper_for">\n\t\t\n\t\t<table class="stats_tabl ...
 [5] <table class="stats_table sortable min_width" id="stats_squads_keeper_for" data-cols-to-freeze=",1">\n<caption>Sq ...
 [6] <div class="footer no_hide_long" id="tfooter_stats_squads_keeper_for">\n\t\t\n\t\t<small>Totals may not be comple ...
 [7] <div class="table_container tabbed current" id="div_stats_squads_keeper_adv_for">\n\t\t\n\t\t<table class="stats_ ...
 [8] <table class="stats_table sortable min_width" id="stats_squads_keeper_adv_for" data-cols-to-freeze=",1">\n<captio ...
 [9] <div class="footer no_hide_long" id="tfooter_stats_squads_keeper_adv_for">\n\t\t\n\t\t<small>Totals may not be co ...
[10] <div class="table_container tabbed current" id="div_stats_squads_shooting_for">\n\t\t\n\t\t<table class="stats_ta ...
[11] <table class="stats_table sortable min_width" id="stats_squads_shooting_for" data-cols-to-freeze=",1">\n<caption> ...
[12] <div class="footer no_hide_long" id="tfooter_stats_squads_shooting_for">\n\t\t\n\t\t<small>Totals may not be comp ...
[13] <div class="table_container tabbed current" id="div_stats_squads_passing_for">\n\t\t\n\t\t<table class="stats_tab ...
[14] <table class="stats_table sortable min_width" id="stats_squads_passing_for" data-cols-to-freeze=",1">\n<caption>S ...
[15] <div class="footer no_hide_long" id="tfooter_stats_squads_passing_for">\n\t\t\n\t\t<small>Totals may not be compl ...
[16] <div class="table_container tabbed current" id="div_stats_squads_passing_types_for">\n\t\t\n\t\t<table class="sta ...
[17] <table class="stats_table sortable min_width" id="stats_squads_passing_types_for" data-cols-to-freeze=",1">\n<cap ...
[18] <div class="footer no_hide_long" id="tfooter_stats_squads_passing_types_for">\n\t\t\n\t\t<small>Totals may not be ...
[19] <div class="table_container tabbed current" id="div_stats_squads_gca_for">\n\t\t\n\t\t<table class="stats_table s ...
[20] <table class="stats_table sortable min_width" id="stats_squads_gca_for" data-cols-to-freeze=",1">\n<caption>Squad ...
...

Best regards,
Ekrem.

@EkremBayar EkremBayar changed the title Regex way finding elements/nodes Finding elements/nodes in regex way Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant