Web scraping the {polite} way

Martin Freeman as Douglas Adams's Arthur Dent, taking a sip of tea and saying 'oh, come on, that's lovely'

Reaping with rvest

Ah, salutations, and welcome to this blog post about polite web scraping. Please do come in. I’ll take your coat. How are you? Would you like a cup of tea? Oh, I insist!

Speaking of tea, perhaps you’d care to join me in genial conversation about it. Where to begin? Let’s draw inspiration from popular posts on the Tea subreddit of Reddit. I’ll fetch the post titles using the {rvest} package from Hadley Wickham and get the correct CSS selector using SelectorGadget by Andrew Cantino and Kyle Maxwell.

# Load some packages we need
library(rvest)  # scrape a site
library(dplyr)  # data manipulation

# CSS for post titles found using SelectorGadget
# (This is a bit of an odd one)
css_select <- "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m"

# Scrape a specific named page
tea_scrape <- read_html("https://www.reddit.com/r/tea") %>%  # read the page
  html_nodes(css = css_select) %>%  # read post titles
  html_text()  # convert to text

print(tea_scrape)
## [1] "What's in your cup? Daily discussion, questions and stories - September 08, 2019"                                                                 
## [2] "Marketing Monday! - September 02, 2019"                                                                                                           
## [3] "Uncle Iroh asking the big questions."                                                                                                             
## [4] "The officially licensed browser game of Game of Thrones has launched! Millions of fans have put themselves into the battlefield! What about you?" 
## [5] "They mocked me. They said that I was a fool for drinking leaf water."                                                                             
## [6] "100 years old tea bush on my estate in Uganda."                                                                                                   
## [7] "Cold brew colors"                                                                                                                                 
## [8] "Finally completed the interior of my tea house only needed a fire minor touches not now it’s perfect, so excited to have this as a daily tea spot"

That’ll provide us with some conversational fodder, wot wot.

It costs nothing to be polite

Mercy! I failed to doff adequately my cap before entering the website! They must take me for some sort of street urchin.

Forgive me. Perhaps you’ll allow me to show you a more respectful method via the {polite} package in development from the esteemed gentleman Dmytro Perepolkin? An excellent way ‘to promote responsible web etiquette’.

A reverential bow()

Perhaps the website owners don’t want people to keep barging in willy-nilly without so much as a ‘ahoy-hoy’.

We should identify ourselves and our intent with a humble bow(). We can expect a curt but informative response from the site—via its robots.txt file—that tells us where we can visit and how frequently.

# remotes::install_github("dmi3kno/polite")  # to install
library(polite)  # respectful webscraping

# Make our intentions known to the website
reddit_bow <- bow(
  url = "https://www.reddit.com/",  # base URL
  user_agent = "M Dray <https://rostrum.blog>",  # identify ourselves
  force = TRUE
)

print(reddit_bow)
## <polite session> https://www.reddit.com/
##      User-agent: M Dray <https://rostrum.blog>
##      robots.txt: 32 rules are defined for 4 bots
##     Crawl delay: 5 sec
##   The path is scrapable for this user-agent

Super-duper. The (literal) bottom line is that we’re allowed to scrape. The website does have 32 rules to stop unruly behaviour though, and even calls out four very naughty bots that are obviously not very polite. We’re invited to give a five-second delay between requests to allow for maximum respect.

Give a nod()

Ahem, conversation appears to be wearing a little thin; perhaps I can interest you by widening the remit of our chitchat? Rather than merely iterating though subpages of the same subreddit, we can visit the front pages of a few different subreddits. Let’s celebrate the small failures and triumphs of being British; a classic topic of polite conversation in Britain.

We’ve already given a bow() and made out intentions clear; a knowing nod() will be sufficient for the next steps. Here’s a little function to nod() to the site each time we iterate over a vector of subreddit names. Our gentlemanly agreement remains intact from our earlier bow().

library(purrr)  # functional programming tools
library(tidyr)  # tidy-up data structure

get_posts <- function(subreddit_name, bow = reddit_bow, css_select){
  
  # 1. Agree modification of session path with host
  session <- nod(
    bow = bow,
    path = paste0("r/", subreddit_name)
  )
  
  # 2. Scrape the page from the new path
  scraped_page <- scrape(session)
  
  # 3. Extract from xpath on the altered URL
  node_result <- html_nodes(
    scraped_page,
    css = css_select
  )
  
  # 4. Render result as text
  text_result <- html_text(node_result)
  
  # 5. Return the text value
  return(text_result)
  
}

Smashing. Care to join me in applying this function over a vector of subreddit names? Tally ho.

# A vector of subreddits to iterate over
subreddits <- set_names(
  c("BritishProblems", "BritishSuccess")
)

# Get top posts for named subreddits
top_posts <- map_df(
  subreddits,
  ~get_posts(.x, css_select = "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m")
) %>% 
  gather(
    BritishProblems, BritishSuccess,
    key = subreddit, value = post_text
  )

knitr::kable(top_posts)
subreddit post_text
BritishProblems Being part of the apparently small minority which is not that interested in Strictly Come Dancing
BritishProblems Not being able to eat a Jaffa Cake without going “full moon, half moon, total eclipse”
BritishProblems Was woken by my teenage son 3 nights in a row to “get an enormous spider” in his room. Told him to go to bed. Last night he woke me to say he’d cornered it. I got up to help and it was the size of a fucking Boeing 747.
BritishProblems Went to Brighton for 5 hours yesterday, £32 for parking! Must tell everyone on earth!
BritishProblems Going on holiday in the UK in September and having to pack for sun, snow and a tropical rainstorm, just in case.
BritishProblems Is there anything that brings us closer together than seeing the Mince Pies that have already arrived in Supermarkets and exclaiming to your fellow shoppers that it’s far too early for this sort of carry on.
BritishProblems Walking down the road and getting hit full in the face by a cotton candy flavour clowd from the twat in a puffer jacket in front of me
BritishProblems Having to push the ‘door open’ button on the train before it lights up to reassure other passengers that we’ve got control of the situation.
BritishSuccess [META] New rule: No Politics allowed
BritishSuccess [META] UK Discord Server
BritishSuccess We were out of the normal Aldi milk so I put some condensed milk in my tea instead and it changed my life. AMA.
BritishSuccess The Aldi checkout lady complemented the organisational structure of my shopping this morning. Finally some recognition for the effort I put in!
BritishSuccess Was stood waiting for the bus for 20, only one there. A few minutes before the bus arrives, there’s a huge line waiting for the bus. Woman tried to cut in as the driver was letting us on & proceeded to tell her to go to the back of the line for cutting in.
BritishSuccess Went to the Pusheen Cafe in Brighton yesterday, wasn’t overpriced crap!
BritishSuccess Introduced myself to the upstairs neighbour whose hardwood floors make him sound like King Kong and he apologised profusely for the noise. Still riding the adrenaline train (in the quiet carriage, of course).
BritishSuccess I ate 3 Greggs (vegan) sausage rolls yesterday.

Bravo, what excellent manners we’ve demonstrated. You can also iterate over different query strings – for example if your target website displays information over several subpages – with the params argument of the scrape() function.

Oh, you have to leave? No, no, you haven’t overstayed your welcome! It was truly marvellous to see you. Don’t forget your brolly, old chap, and don’t forget to print the session info for this post. Pip pip!


Session info

## [1] "Last updated 2019-09-08"
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Locale: en_GB.UTF-8 / en_GB.UTF-8 / en_GB.UTF-8 / C / en_GB.UTF-8 / en_GB.UTF-8
## 
## Package version:
##   askpass_1.1        assertthat_0.2.1   backports_1.1.4   
##   base64enc_0.1.3    BH_1.69.0.1        blogdown_0.11     
##   bookdown_0.9       cli_1.1.0          codetools_0.2.16  
##   compiler_3.5.2     crayon_1.3.4       curl_3.3          
##   digest_0.6.19      dplyr_0.8.1        evaluate_0.14     
##   fansi_0.4.0        future_1.13.0      future.apply_1.2.0
##   globals_0.12.4     glue_1.3.1         graphics_3.5.2    
##   grDevices_3.5.2    here_0.1           highr_0.8         
##   htmltools_0.3.6    httpuv_1.5.1       httr_1.4.0        
##   jsonlite_1.6       knitr_1.23         later_0.8.0       
##   listenv_0.7.0      magrittr_1.5       markdown_1.0      
##   memoise_1.1.0      methods_3.5.2      mime_0.7          
##   openssl_1.4        parallel_3.5.2     pillar_1.4.1      
##   pkgconfig_2.0.2    plogr_0.2.0        polite_0.0.0.9005 
##   promises_1.0.1     purrr_0.3.2        R6_2.4.0          
##   ratelimitr_0.4.1   Rcpp_1.0.2         rlang_0.4.0       
##   rmarkdown_1.13     robotstxt_0.6.2    rprojroot_1.3-2   
##   rvest_0.3.4        selectr_0.4-1      servr_0.13        
##   spiderbar_0.2.1    stats_3.5.2        stringi_1.4.3     
##   stringr_1.4.0      sys_3.2            tibble_2.1.3      
##   tidyr_0.8.3        tidyselect_0.2.5   tinytex_0.13      
##   tools_3.5.2        triebeard_0.3.0    urltools_1.7.2    
##   utf8_1.1.4         utils_3.5.2        vctrs_0.1.0       
##   xfun_0.7           xml2_1.2.0         yaml_2.2.0        
##   zeallot_0.1.0