Typo-shaming my Git commits

A line-drawn monkey poking a typewriter.

The author at work (CC BY-SA 3.0 KaterBegemot)

tl;dr

Nearly 10 per cent of the commits to this blog’s source involve typo fixes, according to a function I wrote to search commit messages via the {gh} package.

Not my typo

I’m sure you’ve seen consecutive Git commits from jaded developers like ‘fix problem’, ‘actually fix problem?’, ‘the fix broke something else’, ‘burn it all down’. Sometimes a few swear words will be thrown in for good measure (look no further than ‘Developers Swearing’ on Twitter).

The more obvious problem from reading the commits for this blog is my incessant keyboard mashing; I think a lot of my commits are there to fix typos.1

So I’ve prepared a little R function to grab the commit messages for a specified repo and find the ones that contain a given search term, like ‘typo’.2

Search commits

{gh} is a handy R package from Gábor Csárdi, Jenny Bryan and Hadley Wickham that we can use to interact with GitHub’s REST API.3 We can also use {purrr} for iterating over the returned API object.

library(gh)    # CRAN v1.2.0
library(purrr) # CRAN v0.3.4

So, here’s one way of forming a function to search commit messages:

search_commits <- function(owner, repo, string = "typo") {
  
  commits <- gh::gh(
    "GET /repos/{owner}/{repo}/commits",
    owner = owner, repo = repo,
    .limit = Inf
  )

  messages <- purrr::map_chr(
    commits, ~purrr::pluck(.x, "commit", "message")
  )
  
  matches <- messages[grepl(string, messages, ignore.case = TRUE)]
  
  out <- list(
    meta = list(owner, repo),
    counts = list(
      match_count = length(matches),
      commit_count = length(messages),
      match_ratio = length(matches) / length(messages)
    ),
    matches = matches,
    messages = messages
  )
  
  return(out)
  
}

First we pass a GET request to the GitHub API via gh::gh(). The API documentation tells us the form needed to get commits for a given owner’s repo.

Beware: the API returns results in batches of some maximum size, but the .limit = Inf argument automatically creates additional requests until everything is returned. That might mean a lot of API calls.

Next we can use {purrr} to iteratively pluck() out the commit messages from the list returned by gh::gh(). It’s then a case of finding which ones contain a search string of interest (defaulting to the word ‘typo’).

The object returned by search_commits() is a list with four elements: meta repeats the user and repo names; counts is a list with the commit count, the count of messages containing the search term, and their ratio; and the messages and matches elements contain all messages and the ones containing the search term, respectively.

Fniding my typoes

Here’s an example where I look for commit messages to this blog that contain the word ‘typo’. Since the function contains the .limit = Inf argument in gh::gh(), we’ll get an output message for each separate request that’s been made to the API.

blog_typos <- search_commits("matt-dray", "rostrum-blog")
## ℹ Running gh query
## ℹ Running gh query, got 100 records of about 700
## ℹ Running gh query, got 300 records of about 1050
## ℹ Running gh query, got 600 records of about 1400
## ℹ Running gh query, got 1000 records of about 1750
## ℹ Running gh query, got 1500 records of about 2100
## ℹ Running gh query, got 2100 records of about 2450

Here’s a preview of the structure of the returned object. You can see how it’s a list that contains the values and other list elements that we expected.

str(blog_typos)
## List of 4
##  $ meta    :List of 2
##   ..$ : chr "matt-dray"
##   ..$ : chr "rostrum-blog"
##  $ counts  :List of 3
##   ..$ match_count : int 59
##   ..$ commit_count: int 691
##   ..$ match_ratio : num 0.0854
##  $ matches : chr [1:59] "Small text adjustments to skyphone and typos posts" "Fix missing words in typo post" "Publish typo post" "Correct typos in sonify post" ...
##  $ messages: chr [1:691] "Add assignment post" "Improve copy, change date of xml post" "Correct app URLs in randoflag post" "Accidentally a word in randoflag post" ...

You can see there were 691 commit messages returned, of which 59 contained the string ‘typo’. That’s 9 per cent.

Here’s a sample of those commit messages that contained the word ‘typo’:

set.seed(1337)
sample(blog_typos$matches, 5)
## [1] "Update sentences and typos in r2eng post"       
## [2] "Add tl;dr, typos"                               
## [3] "fix typos"                                      
## [4] "fix typo"                                       
## [5] "Add hist, stats; explain code better; fix typos"

It seems the typos are often corrected with general improvements to a post’s copy. This usually happens when I read the post the next day with fresh eyes and groan at my ineptitude.4

Exposing others

I think typos are probably most often referenced in repos that involve a lot of documentation, or a book or something.

To make myself feel better, I had a quick look at the repo for the {bookdown} project R for Data Science by Hadley Wickham and Garrett Grolemund.

typos_r4ds <- search_commits("hadley", "r4ds")

The result:

str(typos_r4ds)
## List of 4
##  $ meta    :List of 2
##   ..$ : chr "hadley"
##   ..$ : chr "r4ds"
##  $ counts  :List of 3
##   ..$ match_count : int 290
##   ..$ commit_count: int 1328
##   ..$ match_ratio : num 0.218
##  $ matches : chr [1:290] "Typo fix for model-basics.Rmd (#910)\n\nCorrected \"the\" to \"then\" on line 108" "Two typo fixes for model basics chapter (#908)\n\n* Remove typo\r\n\r\n* Make model function naming convention consistent" "fix typo (#899)" "Potential typo? (#897)\n\nI don't know if it was meant to be this way, because it's actually not weird to say `"| __truncated__ ...
##  $ messages: chr [1:1328] "Merge pull request #924 from mine-cetinkaya-rundel/no-iris\n\nStructural updates for 2e" "Second crack and 2e structure" "Move up tidy data chapter" "Add feather to imports to see if it helps w/ build" ...

Surprise: typos happen to all of us. I’m guessing the percentage is quite high because the book has a lot of readers scouring it, finding small issues and providing quick fixes.

In other words

Of course, you can change the string argument of search_commits() to find terms other than the default ‘typo’. Use your imagination.

Here’s a meta example: messages containing emoji in the commits to the {emo} package by Hadley Wickham, Romain François and Lucy D’Agostino McGowan.

Emoji are expressed in commit messages like :dog:, so we can capture them with a relatively simple regular expression like ":.*:" (match wherever there are two colons with anything in between).

emo_emoji <- search_commits("hadley", "emo", ":.*:")
## ℹ Running gh query
## ℹ Running gh query, got 100 records of about 200
str(emo_emoji)
## List of 4
##  $ meta    :List of 2
##   ..$ : chr "hadley"
##   ..$ : chr "emo"
##  $ counts  :List of 3
##   ..$ match_count : int 21
##   ..$ commit_count: int 112
##   ..$ match_ratio : num 0.188
##  $ matches : chr [1:21] "need emo:: prefix in that case, bc ji_glue might be called without emo being attached. ping @batpigandme" "rm emoji keyboard (saved in separate branch) but eventually might just go in a separate :package:" "emo::ji_rx a meta regex to catch all emojis. closes #14" "bring in some extra modules (for emo::ji_rx)" ...
##  $ messages: chr [1:112] "Imports CRAN glue (#54)" "no longer importing dplyr. #24" "less dependency on dplyr" "clock no longer depends on dplyr" ...

Only 19 per cent? Son, I am disappoint.


Session info
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       macOS  10.16                
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2021-03-14                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date       lib source                            
##  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                    
##  blogdown      0.21       2020-10-11 [1] CRAN (R 4.0.2)                    
##  bookdown      0.21       2020-10-13 [1] CRAN (R 4.0.2)                    
##  cli           2.3.1      2021-02-23 [1] CRAN (R 4.0.2)                    
##  curl          4.3        2019-12-02 [1] CRAN (R 4.0.0)                    
##  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                    
##  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                    
##  gh          * 1.2.0      2020-11-27 [1] CRAN (R 4.0.2)                    
##  gitcreds      0.1.1      2020-12-04 [1] CRAN (R 4.0.2)                    
##  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                    
##  htmltools     0.5.1.9000 2021-03-11 [1] Github (rstudio/htmltools@ac43afe)
##  httr          1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                    
##  jsonlite      1.7.2      2020-12-09 [1] CRAN (R 4.0.2)                    
##  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.2)                    
##  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                    
##  purrr       * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                    
##  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                    
##  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.2)                    
##  rmarkdown     2.6        2020-12-14 [1] CRAN (R 4.0.2)                    
##  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.2)                    
##  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                    
##  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                    
##  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                    
##  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.2)                    
##  xfun          0.21       2021-02-10 [1] CRAN (R 4.0.2)                    
##  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                    
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

  1. Yes, I’m aware of Git hooks and various GitHub Actions that could prevent this.↩︎

  2. Though obviously you’ll miss messages containing the word ‘typo’ if you have a typo in the word ‘typo’ in one of your commits…↩︎

  3. I used it most recently in my little {ghdump} package for downloading or cloning a user’s repos en masse.↩︎

  4. I wonder how many typos I’ll need to correct in this post after publishing. (Edit: turns out I accidentally missed a couple of words, lol.)↩︎