When I analyze the vegetation data, there is a good chance that I want to split plants into different growth forms, e.g. trees, shrubs. However, I do not have a list of them to do the match in R. I searched online and did not find any list available for analysis. The only thing I found is from the University of Wisconsin - Steve point’ s herbarium website. They have an Identification guides. But when I click on, say, Trees link, they do not have a txt file ready to download, and a txt file table is all what I want!
As a result, I decide to create the txt table according to their website, with the help of Sublime text. Here is the procedure:
- Open the website, right click, choose
view page source. - Copy the page source into Sublime text, then we can see the species names are wrapped by
SpCode=and</A>. - In sublime text,
Ctrl+I, turn on regular expression search (Alt+R), typeSpCode=.+</A, then pressAlt+Enter. All matches were selected! Cool! Then copyCtrl+Call of them into a new file (Ctrl+N) in Sublime text! - Not done yet. In the new file,
Ctrl+I, then typeSpCo.+>first, then pressAlt+Enter, then pressDelete. We are half way there, we search</A,Alt+Enter, then pressDelete. Awesome! We got the list of that page! - Repeat this for each page… Fortunately, they are not too many for trees and shrubs! Or copy and paste all page source into one file and then follow the step 2-4, for each growth habit.
I put the list of tree, shrub, vine, fern/fern allies, and graminoids online. Downlaod them if you need.
For each species, if you just want the genus and sp, not including the subsp/var info. You can read them in R and then using regular expression to extract them. Or you can do this in Sublime text. Here is my way doing in R:
library(stringr)
tree=read.table("data/tree.txt", sep="\n", stringsAsFactors=F, quote="")$V1
tree=unlist(str_extract_all(tree, "^[A-Za-z]+\\s{1,1}X?\\s?[a-z]+"))
tree=unique(tree)
In the regular expression part, ^ means at the begining of the string; [A-Za-z] means any letters; + means more than one times; \\s means any space, {1,1} means exactly one time; ? means one time or not present. If you do this in Sublime text, replace \\s with \s.
I hope this will be helpful if you are doing community ecological analysis.
You can do all of these in R easily. I did not write the R code to extract the information at the beginning since there are only several pages and I can do them by hand. If you look at the page source, you will find that the pages’ web address only differ with the number in the web address. As a result, you can get the web address for all of these pages, then read them in R using
readLines(). Then extract the names you want by regular expressions.