English Blog on Daijiang Li

Brief notes of the iDigBio workshop

Mon, 10 Jun 2024 00:00:00 +0000

Advances in Digital Media Workshop Series: Yale

Here are just some of my very brief notes (pretty much just keywords).

LightningBug:
- digitizing specimen labels using ML
- Meta’s Segmentation tool Segment Anything Model (SAM) is good and faster than r-cnn
- 200k images, 6.9k specimens
Heritage Science
- NSF Mid-scale research program
MorphoSource: 3D, 2D, AV media data repository
- Maybe a good place to look for exemplary sites for PhenoBase
Audiovisual Core
Expanding LeafMachine2: new training data, models, and methods for processing herbarium specimens; Will Weaver, PhD Candidate, University of Michigan
Detectron by facebook to detect objects from images
Imageomics ?
Phylogeny-guided neural network (phylo-NNs) Elhamod et al, KDD 2023
IIIF
Phenotypic diversity
- Phenological diversity
- Phenome space
- Segament Anything Model (SAM) + Grounding DINO
- t-SNE visualization for clustering data
More training data is not always better for ML models
- If additional related but not present images are added
Multimodel AI models
- CLIP
- LMMs as effective rerankers
- INQUIRE: text-to-image search of iNaturalist images
2D to 3D reconstruction
- Surface-to-volume ratio seems to be well preserved in shark, snakes

Running R on HiperGator

Mon, 06 May 2024 00:00:00 +0000

The problem

How can I run R on HiperGator within my terminal? The interactive RStudio server works okay, but whenever you request a longer running time or more memory, you will wait much longer in queue. I would prefer to just run R CMD BATCH within my terminal.

Solution

It is probably documented somewhere by HiperGator. I just could not find it easily.

Here are the steps I followed.

Login to HiperGator terminal, install miniforge and mamba
Exit terminal and login back again
In terminal, run mamba create -n nameofmyenvi r-essentials r-base
- add additional packages you want to install, e..g, r-torch
- or install later with mamba install cuda-toolkit=11.8 pytorch
mamba activate nameofmyenvi
In terminal, type R and now you should be able to open R
- If running long time jobs, use tmux with module load tmux, then tmux

Problems with installing R package `arrow`

Mon, 14 Aug 2023 00:00:00 +0000

The problem

On the Linux server, I have recently upgraded R version to 4.3.1. Today, when I try to use the arrow R package to read some large data files, I got the following error:

Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/dli/R/arrow/libs/arrow.so':
  libcrypto.so.1.1: cannot open shared object file: No such file or directory

It seems that the file libcrypto.so.1.1 is missing (not sure why as I did not change the OS in the past couple of months).

Solution

It seems that libcrypto.so.1.1 is included in the libssl1.1 program. I browsed the options at http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/?C=M;O=D

wget http://nz2.archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb

Use the above command to install the missing program. Problem solved. Sign… :smilingfacewithtear: :smilingfacewithtear:

Git used wrong path of `gh`

Mon, 26 Jun 2023 00:00:00 +0000

The problem

On the Linux server, I have installed homebrew to manage software and installed gh to manage GitHub authorizations. It used to work well. Today, I am trying to use git push to push commits to GitHub there after a while without using it. However, it complained that it cannot find the gh bin.

/home/linuxbrew/.linuxbrew/Cellar/gh/2.14.3/bin/gh auth git-credential get: 1: /home/linuxbrew/.linuxbrew/Cellar/gh/2.14.3/bin/gh: not found

It seems that this is the version issue for gh as brew has updated it to a later version. Yet, somehow the git push is still using the old path.

Solution

gh auth setup-git

Use the above command to set or update git to use GitHub CLI gh as the credential helper for all authenticated hosts. Problem solved.

Tensorflow and R set up on server

Fri, 04 Nov 2022 00:00:00 +0000

I am trying to set up Tensorflow and Keras on a Ubuntu server. And I want to interact with them through R. I came across some errors such as

Error: Valid installation of TensorFlow not found.

ModuleNotFounderror: No Module named '_ctypes'

After gooling, here is the code I used to solve this issue, following the instructuin here.

sudo apt update

sudo apt install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev \
libffi-dev liblzma-dev

# you may need to use apt-get

Then, in R:

library(reticulate)
path_to_python <- install_python(force = T)
virtualenv_create("r-reticulate", python = path_to_python)
install.packages("keras")
install_keras(envname = "r-reticulate")
tensorflow::tf_config()

It seems to work now.

Library not found for `-lgfortran`

Tue, 01 Nov 2022 00:00:00 +0000

After updating to macOS 13.0 (Ventura), somehow I got the following error when compile an R package with C++ code:

ld: warning: directory not found for option '-L/usr/local/gfortran/lib'    
ld: library not found for -lgfortran macos ventura

Probably it is because the homebrew installed gfortran cannot be found by the system after the upgrading. I was in rush and did not have the time to figure out this. Instead, just went to this webpage and downloaded the latest gfortran package and installed it manually. After installation, I was able to compile the package again. Problem solved for now.

Host R packages on r-universe

Fri, 30 Sep 2022 00:00:00 +0000

The problem

Here is the problem: I am developing an R package rtrees, which depends on a data package megatrees with size around 100 Mb. It is not possible to submit the data package to CRAN given its large size. In addition, CRAN does not allow packages with Remotes field (i.e., your package cannot depends on a package on GitHub). Therefore, I cannot submit rtrees to CRAN.

Solution

After searching around, I came across the R-universe program by rOpenSci. R-unierse allow us to build binary files for R packages and host it online; basically, we can have our own personal CRAN-like repo to host binaries for R packages without much trouble by following its instruction. Now, my data package is on my r-universe. And in the DESCRIPTION file of rtrees, I can replace Remotes with the following line:

Additional_repositories: 
    https://daijiang.r-universe.dev

I think this should allow me to submit rtrees to CRAN in the future. Since R-universe build binaries for the R packages we put there (Mac and Windows), it is now pretty fast to install large packages.

Shinny App

When I deploy the Shiny app of rtrees, shinyapps.io does not recognize r-universe and returned an error. To deploy it, I need to reinstall the package from GitHub using remotes::install_github(). This is because when deploying the shinny app, R will use the same way that you have installed the packages locally. If I installed the package from r-universe, R will try to do the same thing when deploying the shinny app; if I installed the package from GitHub, R will also install it from GitHub when deploying the shinny app.

Weird R issue caused by messed up BLAS/LAPACK libraries

Mon, 22 Aug 2022 00:00:00 +0000

Today, the server had some really bizarre behavior: when run a simple linear regression using R multiple times, the results are totally different! How is this possible? There is no randomness in the linear regression, it is deterministic!!

I had no idea what is going on. Therefore, I posted in on Stack Overflow. With some help from others, I though the issue may be from the BLAS/LAPACK libraries on the server.

Currently, I have the Intel MLK version on the computer.

BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so

So, I changed it to the OpenBLAS library as it has similar speed as MLK.

sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
sudo update-alternatives --config liblapack.so.3-x86_64-linux-gnu

After restarting R, the problem is solved!! What a weird one.

Useful Vim commands

Wed, 09 Mar 2022 00:00:00 +0000

Vim commands

Normal mode to insert mode

i: insert text just before the cursor.
I: insert text at the start of the line.
a: append text just after the cursor.
A: append text at the end of the line.
o: open a new line below.
O: open a new line above.
s: substitute the current character.
S: substitute the current line.
r: replace the current character.
R: replace continuous characters.

Move cursor around

0: move to the start of the line.
^: move to the first non-blank character of the line.
$: move to the end of the line.
ctrl-b: move back one screen.
ctrl-f: move forward one screen.
H: jump as high as possible, i.e. the first line of the window.
M: jump to the middle of the window.
L: jump to the lowest line of the window.
G: jump to the end of the file.
1G or gg: jump to the start of the file.
30G: jump to line 30.
w: move to the start of next word. 2w move two words.
e: move to the end of next word.
b: move backward one word.
(: move to previous sentence.
): move to next sentence.
{: move to previous paragraph
J: move to next paragraph.
ctrl-o: jump backward to previous location.
ctrl-i: jump forward to next location.
ma: mark current position, then move to other places and use 'a (single quote then a) to come back to the start of the marked line, or you can use ‘a' (backtick and a) to jump to exact place. You can also usemband 'b’, i.e. any letters a-z A-Z.
%: jump to corresponding item, e.g. from left brace to the right brace

Visual mode

v then ap: (a paragraph) choose a paragraph where the curson is on.
y then aw: choose the word where the cursor is on.
u then a": choose the whole quoted word/sentence where the cursor is within.
u then ab: choose a block of text, i.e. within parathese, brakets, etc.
~: switch cases of letters, i.e. upper to lower, lower to upper.
V: visual mode with lines.

Edit text

d: delete and put text into clipboard.
dd: delete the current line.
dl or x: delete the current letter where the cursor is.
dw: delete a word where the cursor is.
d$: delete text after the cursor of the current line.
d0: delete text before the cursor of the current line.
dh, dj, dk, d2ap, d2w, d31, 24h, d5j, etc. Combine number with options.
y: yank or copy text
yy: yank the current line.
yap: yank the current paragraph
p: paste text after cursor position.
P: paste text before cursor position.
xp: cut then paste after cursor, so swap two characters.
dwwP: cut one word, move to next word, paste before that word, so swap two words
.: repeat last action. If you want to repeat a series of actions, use qa to start recording a macro, then do changes, then press q to stop recording, in another line, use Ca to do same changes on that line. Or qb, qc, etc.
u: undo the last change.
ctrl-r: redo the undo.
:earlier 5m: back to five minutes ago, i.e. time machine..
:later 45s: forward in time…
:undo 5: undo the last five changes.
:noh: no highlight after search.
:undolist: view the undo tree.

Search

/word: to move to the first occurrence of word.
n: go to next occurrence.
N: go to previous occurrence.
/\<word\>: search word exactly.
/\d\*: search 0 or more digit (s).

Search and replace

:S/search/replace/g: search and replace in current line.
:%S/search/replace/g: search and replace in all lines.
:%S/search/replace/gc: ask for confirmation

Multiple files

Multiple sections

:set foldmethod=indent, then at indent line, ze to close the fold (compress), zo to open the fold, or za to switch between close and open, alternate.

Multiple files

:edit file1, :e file2, then use :b 1 to go to file l, :b 2 to file 2. b means buffer. :ls to show all editing files.

Multiple windows

:new: to open a new window.
ctrl-w h/j/k/l or ctrl-w ctrl-w: to move among windows.
:sp: to split current window. or use ctrl-w s.
:vsp: to split vertical window, or use ctrl-w v.
ctrl-w r: to rotate positions of windows.
ctrl-w K: move current window to topmost position.
:resize 10 or resize -10: change window size to display 10 more/less lines.
ctrl-w _: increase current window size as much as possible.
ctrl-w =: make all windows same size.

Multiple tabs

:tabnew: to open a new tab
gt: go to next tab.
gT: go to previous tab.
:tabmove: to reorder tabs, e.g. :tabmove O moves the current tab to the first position.

Others

:!: to run shell commands within vim.

Add Multiple Passport Photos on One Page using R

Wed, 10 Jan 2018 00:00:00 +0000

In this post, I document how to put multiple photos on one page to save paper.

First, after taking the photos, I edited them with GIMP: adjust light, color, crop to the desired area. Then we need to scale the cropped photo to the specified size. To do this, in the GIMP, I first selected image/scale image. This will allow us to scale the photo to the size required; it also allow us specify the resolution in different units (e.g. pixel/inch, pixel/mm). If you have photos for more than one kid, then make sure that both photos have the same size and resolution. This will make later steps easier. It would also be useful to check (and scale if necessary) the print size too. After scaling the photo, export it as an external file. Since I have two kids, I got two photos (same size and resolution) in my folder after these steps.

Time to use R. Specifically, the magick package did all the heavy lifting.

First, read the photos into R.

library(magick)
pic1 = image_read("pic1.jpg")
pic2 = image_read("pic2.jpg")
image_info(pic1) # size in pixel
image_info(pic2) # both should have the same size

To put multiple photos together, we can use the magick::image_append() function. This function, however, does not have an argument to specify the space between photos. Thus we need to create a blank image as a separator.

sep = image_graph(width = 100, height = image_info(pic1)$height, 
                  bg = "white")
plot(1, type = "n", axes = F, xlab = "", ylab = "")
dev.off()

Great, now we are ready to put them together.

pic1s = image_append(c(pic1, sep, pic1, sep, pic1, sep, pic1))
pic22 = image_append(c(pic2, sep, pic2, sep, pic2, sep, pic2))
# stack both
both = image_append(c(pic1s, pic2s), stack = TRUE)

Here I put four photos for each of them. You can adjust the above code if you want different numbers.

Finally, save the image to the disk.

image_write(both, path = "both.jpg")

Check it out! We have multiple photos in one page now. One additional step (optional) is to open the new both.jpg in GIMP and set the cavans size. I set it to 6 by 4 inches and exported it out.

That’s it. Super simple but very useful.

Fetching phylogenies from Phylomatic with R

Fri, 25 Aug 2017 00:00:00 +0000

It is usually a good idea to control for species evolutionary history if we want to get robust results. This is because species are not independent from each other, thus violate the independence assumption of data for most statistical models. Fortunately, with growing available genetic data and softwares, building phylogenies are getting easier and easier.

Phylomatic is an easy way to fetch phylogenies for species, especially plants, on line. Thanks to packages developed by rOpenSci, we can now use Phylomatic within R. One big advantage of this is reproducibility, which means that we can regenerate the phylogeny whenever we want without click on buttons on the website. In addition, because most ecologists are using R for downstream analyses, fetching phylogenies within R will make the workflow much natural and easy to follow.

The basic procedure for fetching phylogenies with Phylomatic using R will be:

Compile the species names we want to include in the phylogeny; and clean if necessary (taxize package, rotl::tnrs_match_names())
Clean and prepare species names in the format to be used with Phylomatic (brranching::phylomatic_names())
Query Phylomatic and return the phylogeny (brranching::phylomatic(); if you have hundreds species, it is better to use Phylomatic locally with brranching::phylomatic_local())¹

It is possible to merge step 2 and 3, but I prefer to separate them.

I assume that you already have a list of species, named as sp_list. Then we can use the phylomatic() function from the brranching package. If you do not have it installed, install it first with install.packages("brranching").

sp_list = c()
tree = brranching::phylomatic(sp_list)

If you have few species, this will likely give you a phylogeny with all species. However, in practice, it is quite possible that you will get a warning like this:

NOTE: 3 taxa not matched: NA/genus/species, ...

In this case, we may try to prepare species names first with brranching::phylomatic_names(). The default database will be ncbi, but if you have hundreds of species, this can be slow. Instead, I would suggest to use ape first because it is much faster (this is the default within brranching::phylomatic()). Then filter out those species have NA as family and try ncbi or itis (these are the three database supported). Sometimes, your species names are not clean, e.g. with synonyms, then the R package taxize will be really handy. In addition, I find rotl::tnrs_match_names() is also good to check and solve names. This function will compare with Open Tree of Life to check species names.

sp_list_phylocom = brranching::phylomatic_names(sp_list, 
                                                format = "isubmit", 
                                                db = "ncbi")

Now, let’s try to fetch the phylogeny again, with the updated species list.

tree = brranching::phylomatic(sp_list_phylocom)

As mentioned eariler, it is possible to merge these two steps into one with tree = brranching::phylomatic(sp_list_phylocom, db = "ncbi") but I prefer to solve species names first.

The default backbone phylogeny is the APG III R20120829. We can use the Zanne et al. 2014 phylogeny.

tree = brranching::phylomatic(sp_list_phylocom, 
                              storedtree = "zanne2014")
plot(tree)

Finally, I have one reproducible example that shows how to use the brranching package to get phylogeny for plants at Github. Feel free to check it out (and the associated paper if you are interested in)!

Another option to use Phylomatic locally is to download Phylocom, which can also be used within R using package phylocomr ↩

List of functions from tidyverse that I do not use often

Sat, 05 Aug 2017 00:00:00 +0000

I do not use these functions often, but they can be really useful for some tasks.

ggplot2 package:
- coord_cartesian(xlim = , ylim = ) to zoom in a part of a figure, which is different from xlim() or scale_x_continuous(limits = ). The later will simply toss data points.
- cut_width(), cut_interval(), cut_number() to convert a continous variable to groups.
- ggplot by default will drop categories without any value, to avoid this, use ... + geom_bar() + scale_x_discrete(drop = FALSE).
- reorder factor according to an numerical variable: ggplot(data, aes(num_var, forcats::fct_reorder(factor_var, num_var))) + geom_point().
- remove legend: ... + guides(fill = FALSE) or ... + guides(color = FALSE)
- change legend rows: ... + guides(fill = guide_legend(nrow = 1))
- change legend title: ... + labs(fill = "title") or ... + labs(color = "title") or ... + scale_fill_xxx(name = "title")
- change axes tick labels: e.g. ... + scale_x_log10(labels = scales::dollar, labels = scales::wrap_format(10), breaks = ...). Package scales can be useful.
- draw maps: ... + geom_polygon(aes(group = group)) + coord_map(projection = "albers", lat0 = 39, lat1 = 45)
- when write a function for plotting, aes_string() can be useful.
- scale_x_continuous(expand = c(.1, .1)) to expand the plot to avoid cutoff of labels.
- scale_x_discrete(limits = rev(level(grp))) to reverse the order of a factor.
- p + xlab(NULL) to remove x labels and its space.
tidyr package:
- complete() complete a data frame with missing combinations of data. Turns implicit missing values into explicit missing values.
- fill() Fills missing values in using the previous entry. Useful if repeated values are omitted. Last observation carried forward.
- convert = TRUE within gather() and spread() to convert the generated column into correct types.
- extract() with regular expressions to extract part of a column.
dplyr package
- transmute() will only keep generated variables.
- count() count the number of observations.
- left_join(x, y, by = c("a" = "b")) when key variable has different names in x and y.
- bind_rows(list) = plyr::ldply(list): stack a list into a data frame (not always work, e.g. bind_rows(list(1:2, 3:4)) does not work but ldply() works)
stringr package
- str_subset(words, "x$") = words[str_detect(words, "x$")]
- str_count() will count how many matches resulted from str_detect(). str_count("abababa", "aba") will return 2.
- When you use a pattern that’s a string, it’s automatically wrapped into a call to regex(). See more options for regex().
forcats package
- fct_reorder(), fct_reorder2()
- fct_infreq(), fct_rev(), fct_recode(), fct_collapse(), fct_lump()
purrr package
- map(imput, fun), similar as lapply(); when input is a data frame, do something specified by fun to each column and return as a list. If want to return vector, use map_dbl(), map_lgl(), etc.
- when input is a list, same as plyr::l_ply(); e.g. we can use split(mtcars, mtcars$cyl) to get a list from a data frame.
- split(mtcars, mtcars$cyl) %>% map(~lm(mpg ~ wt, data = .)) do a lm to each element of the list; ~ is a shortcut for anonymous function, e.g. split(mtcars, mtcars$cyl) %>% map( function(df) lm(mpg ~ wt, data = df))
- a list of models from the above point named as models, then models %>% map(summary) %>% map_dbl(~.$r.squared) will extract $R^2$ of each model. We can do this by strings too: models %>% map(summary) %>% map_dbl("r.squared"); can even use position sometimes, e.g. map_dbl(list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9)), 2).

fopenmp option of clang error

Wed, 21 Jun 2017 00:00:00 +0000

When I try to source a Rcpp file, I got the following error under macOS:

clang: error: unsupported option '-fopenmp'

After a little bit Googling, I found this post, which at the end solved my problem. Briefly, I did the following steps:

Installed xcode from Apple Store (a simplified version may be enough) or in Terminal withxcode-select --install
Installed llvm via brew install llvm
Downloaded and installed the gfortran binary installer from here. Note: You will need to download the OS X El Capitan gfortran 6.1 binaries regardless of whether or not you are on macOS Sierra, which presently only offers gfortran 6.3.
Downloaded and extracted clang to /usr/local/clang (overwrite it if already exists, sudo cp -r ~/Downloads/usr/local/clang4 /usr/local/clang4). See here for more information.
In terminal

   cat <<- EOF > ~/.R/Makevars
   # The following statements are required to use the clang4 binary
   CC=/usr/local/clang4/bin/clang
   CXX=/usr/local/clang4/bin/clang++
   LDFLAGS=-L/usr/local/clang4/lib
   # End clang4 inclusion statements
   EOF

Then, the error is fixed (for now)!

Blog posts with academic styles

Sat, 20 May 2017 00:00:00 +0000

Goal: I want to write academic style blog posts with: citations, cross-reference of tables and figures, and I want to manage figure path by myself. The default setting of blogdown can handle citations and cross-references pretty well thanks to Yihui’s awesome work on bookdown and blogdown packages, but the figures are nested too deep. I just want to put all figures under static/figures.

After a bit of digging, I managed to do this. The main trick is to add a knitr setup chunk to the Rmd file, and then parse it with blogdown::render_page(), based on Yihui’s set up. If a post does not have any figures, it will pass the first step and go directly with blogdown::render_page(). I did not look through all functions available from the blogdown package. But I am sure there must be a better way to do this. Anyway, I get what I want at this moment.

Citations

For citations, put the bibtex file in the same folder as the post, and then add bibliography: ref.bib in the yaml. You can even define the citation styles via csl: url_of_csl_file in the yaml.¹ Thousands of csl files are available at Github CSL repository. Go and find one you like and paste the url in the yaml.

Testing paragraph: Invasion of non-native species, one of the most widespread and harmful consequences of global change, is causing worldwide ecosystem degradation and economic loss (Vilà et al., 2011; Simberloff et al., 2013).

Math equations

Here is inline equations $a^2 + b^2 = c^2$; and display equations:

\[f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\]

R code chunk

summary(cars)

##      speed           dist    
##  Min.   : 4.0   Min.   :  2  
##  1st Qu.:12.0   1st Qu.: 26  
##  Median :15.0   Median : 36  
##  Mean   :15.4   Mean   : 43  
##  3rd Qu.:19.0   3rd Qu.: 56  
##  Max.   :25.0   Max.   :120

Including plots and cross-refer it back

You can also embed plots and cross-refer it with \@ref(fig:figure-label), for example Figure 1:

plot(pressure)

Figure 1: Here is figure caption.

Table and cross-reference

You can also print tables and cross-refer it with \@ref(tab:table-label), for example Table 1:

knitr::kable(head(pressure), caption = "Table legend.")

Table 1: Table legend.
temperature	pressure
0	0.0002
20	0.0012
40	0.0060
60	0.0300
80	0.0900
100	0.2700

That’s it.

Any suggestions to improve this workflow? Comment below or send me a pull request. Thanks.

References

Simberloff, D., Martin, J.-L., Genovesi, P., Maris, V., Wardle, D.A., Aronson, J., Courchamp, F., Galil, B., García-Berthou, E., Pascal, M. & others (2013) Impacts of biological invasions: What’s what and the way forward. Trends in ecology & evolution, 28, 58–66.

Vilà, M., Espinar, J.L., Hejda, M., Hulme, P.E., Jarošík, V., Maron, J.L., Pergl, J., Schaffner, U., Sun, Y. & Pyšek, P. (2011) Ecological impacts of invasive alien plants: A meta-analysis of their effects on species, communities and ecosystems. Ecology letters, 14, 702–708.

A local file in the same folder will work too.↩

Reading notes: phylogenetic comparative models

Sat, 13 May 2017 00:00:00 +0000

Brief notes for my own use from a short primer by Cornwell & Nakagawa, 2017, Current Biology. It is rather simple and basic, but can be a good intro/reminder about big pictures.

Phylogenetic comparative methods

To explain the evolution of Earth’s diversity, phylogenetic comparative methods (PCM) often combine phylogeny with traits of species. The building of phylogenies (phylogenetics) is different from PCMs though they are not independent. PCMs are used to address the questions:

how did the characteristics of organisms evolve through time?
what factors influenced speciation and extinction?

Because species are not independent with each other, traditional linear regressions are not applicable. Felsenstein (1985), one of the first paper of PCMs, used phylogenetic independent contrasts to avoid this problem. Basically, instead of using species as data point, we can use the evolutionary branching point (divergence) as a replicate in the model.

Trait evolution

We want to study the speed (tempo) and the manner (mode, e.g. slow and gradual, fast, with big jumps) of trait evolution. Common models are Briownian motion and Ornstein–Uhlenbeck models of trait evolution.

We also want to study evolutionary links among traits and between traits and environmental variables. Advanced methods include generalized linear mixed models and structural equation models that account for species evolutionary relationships.

Lineage diversification

Why are some lineages more speciose than others of similar age? Where and when on the phylogeny were there shifts in diversification rate? And why did those shifts occur?

PCMs in different disciplines

Other disciplines are also using PCMs, e.g. community ecology, linguistics, anthropology and paleobiology, by building phylogenies for e.g. languages.

Caveats and the future of PCMs

Tree uncertainty. Species can be misplaced in a phylogenetic tree, ancestral nodes can be wrongly inferred, or more subtly, but more commonly, branch lengths are incorrect.
Trait uncertainty. Traits are measured with error. And for most PCMs, we used representative values, but it is hard to define representative. For example, what is the representative value for human height?
Model uncertainty. When we investigate trait evolution, we assume a certain model of evolution — most often, the Brownian motion model. However, a trait can evolve quite differently from such a simple model and there may be heterogeneity in the tempo and mode among the branches of the tree.

Notes of Data-driven Ecological Synthesis

Tue, 09 May 2017 00:00:00 +0000

I went to the excellent data-driven ecological synthesis summer school at the Station de Biologie des Laurentides (SBL) of the Université de Montréal, organized and taught by Timothée Poisot and Dominique Gravel. The station is one of the best research station I have ever been: great view, nice staffs, and excellent food! The teachers are very approachable and very knowledgeable. Classmates are very nice to each other and we had lots of fun together. For example:

Setted up a huge fire at the end of the day. @tpoi @willvieira90 @ernlarson @gwynmac pic.twitter.com/BaCnH5pUBp
— Daijiang Li (@_djli) May 5, 2017

and

Nice hiking after a day of open data/ project discussion. @tpoi pic.twitter.com/5ZELNTKtWA
— Daijiang Li (@_djli) May 4, 2017

Thanks all for a great week.

Here is my very brief note during this one-week class.

2017/05/01

What is data? Observations of variables have value and unit.
- meta data: when, who, how, why, intel. property?
Data plan? (NSF funded: data one, data life cycle https://www.dataone.org/data-life-cycle) (talked about 50 mins)
1. collect
2. assure: quality control:
3. describe: meta-data?
4. preserve: backup, ask computer center of University; figshare, etc. Be careful with Dropbox if you have government data etc. long-term archive. Who can have access?
5. discover: identify data you need, which not necessary collected by yourself.
6. integrate: put different temporary/spatial scales data together
7. analysis: overview of the data analyses to conduct.
exercise: 2-3 people/group, read a paper selected by themselves, discuss 2-3 steps of the data life cycle, how they did that? weakness? good? 20-30 mins.
Be serious about data archive/integration when applying for funding / writing grant reviewing.
Ten Simple Rules for Creating a Good Data Management Plan
Ten Simple Rules for Digital Data Storage
Spreadsheet: flat files
- type SEP3, sept03, or sep03; and excel turned it into 3-Sep or 9/3/2017. Even save as csv file at the end, they are all 3-sep, not what you typed in.
- tidy data: every column as variable, every row as an observation.
- NO THINGS: no merging cells, no color, no blank cells (be explicit about missing data and other possible issues that will result in missing data), no single information (no multiple tables)
- dates: YYYY-MM-DD-HH-MM-SS-TZ or split into date, time, and time zone.
- Location coordinates: be explicit about the format.
Template: use template to input data at the beginning of projects; when explain the variables, be explicit about possible values or rules to record. For example, how to name a site; for species, use Latin names; format of dates; etc.
Exercise: everyone creates a template for their own projects. 30 mins.

2017/05/02

OpenRefine (morning)
- explore different datasets: facets, transform of cells, filtering of rows, transform cells, explore scatter plots, e.g. [value, cells["mo"].value, cells["dy"].value].join("-")
- input datasets by multiple urls.
  - json files, select “rows” instead of “records” to make life easier.
Jupyter notebook + R (afternoon)
- a little bit of data manipulation.
- parallel with plyr: library(doMC); registerDoMC(detectCores() - 1); ddply(.parallel = T)
- Book recommendation: The Pragmatic Programmer: From Journeyman to Master

2017/05/03

Morning
- Group discussion about mandatory data sharing/open (for and against, 2 groups, morning 45 mins)
  - debate.
  - For: drive to a better science system (system > individuals)
  - Against: unfair (synthesis vs data collectors;)
- Data sets and API (request/url and responses/json object); rOpenSci project/packages.
Afternoon
- Discussion about possible projections till 3pm
- Dom gave a talk about how public data can do. (Beyond the checklist: the biogeography of ecological interaction networks)
  - biogeograph: spatial and temporal distribution of species and abundance, including causes and consequences.
  - the dominant conceptual tool in biogeograph: the niche.
  - Is resource availability constant across gradients?
  - predation pressure constant across gradients?
  - how do covary interaction strength and pop abundance
  - what about highly diverse communities?
  - A community is more than a checklist
  - how do we move from a regional meta web to a local web?
  - revise biogeograph by including species interaction
  - Gravel et al 2011 Ecol. Lett.
  - OBIS: marine occurrence data set.
  - fishbase: fish characteristics.
  - connectance very high in global Marian fish networks
  - how do you control for data quality? with huge datasets, the impact of errors may be not too problematic. More importantly, with complex pipeline of scripts, be careful about possible programming errors. defensive programming
  - be careful about sensitivity of data analyses to data quality.
- talking about designing database.
  - be defensive when design: for example, set types of possible inputs (characters, small integers, etc. error control), api design (JavaScript), advantages of api: security, portability, remote working.

2017/05/04

Morning
- Dom. Gravel suggested books
  - An Illustrated Guide to Theoretical Ecology by Ted J. Case
  - The Theoretical Biologist’s Toolbox: Quantitative Methods for Ecology and Evolutionary Biology by Marc Mangel
- Rational data bases
  - advantages: efficiency, security, remove redundancy, faster query, allow multiple users work on the dataset at the same time
  - SQL: structural query language
```
SELECT sphote AS host, sppar AS parasite, COUNT(sppar) AS number, AVG(a) AS a
FROM morphometry 
WHERE host is "Disa"
GROUP BY sphote, sppar
HAVING number > 3
ORDER BY number DESC
LIMIT 4
```
  - SQL ecology data carpentry
Afternoon
- Brief about projects to work on. (4 projects, and I work on my own project)
- Work on projects

2017/05/05

Morning
- Git/Github: common words e.g. repository, stage, commit, branch, merge
- License: choose a license
- github commit emoji 📚 comments here; ✨ comments list of emoji 🐛
- Illustrate collaboration via github
- Optimization coding
  - R high performance tutorial
  - Hadley’s optimising code chapter: went through the R code a little bit.
Afternoon
- work on project.

2017/05/06

Work on project the whole day.

2017/05/07

Morning
- Work on project; started group presentations at 10:30am, till 12pm.
Afternoon
- Back to Montreal at 3:30pm.

R packages installation issues

Sat, 29 Apr 2017 00:00:00 +0000

Some R packages that require installation from source are hard to install. Here, I just record some of the problems and solutions I have came acroos when installing R packages on macOS.

`rgdal` package

It is kinda annoying to install this package. But I find this answer to be helful for me to install it.

Basically, in terminal, install GDAL first, which will take a while:

brew install --with-postgresql gdal

Then in R:

install.packages('rgdal', type = "source", configure.args=c('--with-proj-include=/usr/local/include','--with-proj-lib=/usr/local/lib'))

`sf` package

According to its github readme file, we may be able to install binary package for sf. But this is not the case for me today. This may because that R 3.4.0 just released and they did not prepare a binary version on CRAN yet. So, I still need to install from source. In its readme file, we need to do this in terminal first (takes a while, ~10 minutes):

brew unlink gdal
brew tap osgeo/osgeo4mac && brew tap --repair
brew install proj 
brew install geos 
brew install udunits
brew install gdal2 --with-armadillo --with-complete --with-libkml --with-unsupported
brew link --force gdal2

Then we can go to R and install it normally.

to be updated

Clipping shape files in R

Wed, 12 Apr 2017 00:00:00 +0000

Suppose we have two shape files: one larger (e.g. shapefile of ecoregions of North American) and one smaller (e.g. shapefile of US lower states). How can we get the shapefile of ecoregions for only the US lower states?

After a little bit searching ¹, I came with the following R function:

library(rgeos)
library(sp)
clip_shp = function(small_shp, large_shp){
   # make sure both have the same proj
  large_shp = spTransform(large_shp, CRSobj = CRS(proj4string(small_shp)))
  cat("About to get the intersections, will take a while...", "\n")
  clipped_shp = rgeos::gIntersection(small_shp, large_shp, byid = T, drop_lower_td = T)
  cat("Intersection done", "\n")
  x = as.character(row.names(clipped_shp))
  # these are the data to keep, can be duplicated
  keep = gsub(pattern = "^[0-9]{1,2} (.*)$", replacement = "\\1", x)
  large_shp_data = as.data.frame(large_shp@data[keep,])
  row.names(clipped_shp) = row.names(large_shp_data)
  clipped_shp = spChFIDs(clipped_shp, row.names(large_shp_data))
  # combine and make SpatialPolygonsDataFrame back
  clipped_shp = SpatialPolygonsDataFrame(clipped_shp, large_shp_data)
  clipped_shp
}

By running clip_shp() function, we will return a shapefile of the intersections between the two input files ².

Another problem is that such kind of shapefiles are too large to plot. ggplot() may run forever with the data frame fortified from the shapefile. One solution is to first convert the shapefile into a data frame, then thin the data frame. Simply using dplyr::sample_frac() won’t work though. Here is a function I wrote (though kind of slow):

# the larger the tol is, the less rows the result will have
thin = function(x, tol = 0.01){
  id = unique(x$id)[1]
  x1 = x[, 1:2]
  names(x1) = c("x", "y")
  x2 <-shapefiles::dp(x1, tol)
  data.frame(long = x2$x, lat = x2$y, id = id)
}

library(ggplot2)
library(dplyr)
# convert shapefile to data frame
shp_df = fortify(shp, region = "NAME") # change the region accordingly
# for each group, thin it
shp_df_thin = select(shp_df, long, lat, id, group) %>%
  group_by(group) %>%
  do(thin(., tol = 0.02))

Then we can use the thinned data frame to happily/fastly plot with ggplot().

ggplot(data = shp_df_thin) + 
  geom_polygon(aes(x = long, y = lat, group = group), 
               color = "black", fill = "white") +
  coord_map()

Post here in case it will be helpful (to someone else or future myself).

mainly this post: https://philmikejones.wordpress.com/2015/09/01/clipping-polygons-in-r/ ↩
Of course, you need to read them first into R. E.g. small_shp = rgdal::readOGR("path/to/file", layer = "file_name") ↩

Writing Academic Papers with Rmarkdown

Wed, 05 Apr 2017 00:00:00 +0000

TL;DR: Rmarkdown and bookdown are awesome; you should use it to write papers; and here is a minimal example.

I have been using LaTex for most of the papers I have published so far (admittedly not that many), even though all of my co-authors use Microsoft Word. Why? Several reasons for this.

When wrting, we should only focus on the content, not worrying about the typesetting, which we will take care later. Word, on the other hand, allows you to see what you get when you write. This makes people (me at least) hard to ignore the typesetting when writing.
It is hard to update the figures and pictures inserted into the manuscript in Word. You need delete old ones and insert new ones whenever your figures are updated. Of course, you can say that do not insert figures until the submission. But wouldn’t it be easier to revise the manuscript when figures are included in the main text? Using LaTex, I can just put the path of figures there and not worry about replace them in the main text.
Literature programming: LaTex allows us to mix code with text in the same file, which increases the reproducibility and decreases potential errors.
Cross-references is easy in LaTex (just \label and \ref). With Word, it is painful to get the same thing.

However, LaTex has its learning curve and quirks. And even though it intends to make people to focus on content, we usually spend lots of time fighting with things like floats. Not to mention the collaboration barrier betwen its users to Word users. When I finished a draft of my paper, I need to convert it to Word using pandoc so my advisor can edit. Doing it this way, however, figures and tables are usually messed up, as well as cross-references. Tables will be just LaTex source codes there; cross-references will be replaced with their labels (e.g. see Table tab-labels instead of see Table 1). So everytime, I need to write something like “please do not care about the typesetting” in the email to my advisor.

Until recently, I found that the convertion from LaTex and Rmarkdown to Word is reasonably good, thanks to bookdown by Yihui. I just finished my first manuscript written in Rmarkdown 100%. Both my advisor and I are quite happy with it. Therefore, in this post, I am going to talk briefly the process of writing academic papers with Rmarkdown.

Markdown and Rmarkdown

First, you need to know a little bit about the syntax. Don’t worry, I am sure you will get it in five minutes. If you use Rstudio, this can be found under “Help” menu.

Packages needed

In this post, I have installed the following R packages: bookdown, rmarkdown, tufte, and knitr. If you do not want to produce pdf files, then you are ready to go. If you need pdf files, then you need to install Latex. Under Windows and Linus, Texlive is good; under Mac, Mactex is available. If you do not mind the time to download and install, I recommend to install the full version, which includes all LaTex packages.

Writing with Rmarkdown

After installing all dependencies, we can open Rstudio (or any text editor) and start writing. I use Rstudio to start the file and work with R code chucks. For the remaining, however, I use Sublime text or Atom. This is mainly because the lack of distraction free function in Rstudio.

Yaml head

Here is the Yaml head I am using now:

---
title: Your awesome tile
author: "Author one and Author Two"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output:
  bookdown::tufte_html2:
    number_sections: no
    toc: yes
  bookdown::word_document2: null
  bookdown::pdf_document2:
    includes:
      before_body: doc_prefix.tex
      in_header: preamble.tex
    keep_tex: yes
    latex_engine: xelatex
    number_sections: no
    toc: no
bibliography: path/to/ref.bib
fontsize: 12pt
link-citations: yes
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/global-ecology-and-biogeography.csl
---

A few notes here:

I use the bookdown::pdf_document2, and other bookdown::...document2, which allow cross-references and other features possible.
For pdf files, I included some tex files (preamble.tex, which includes some packages I use, e.g. lineno to add line numbers; and doc_prefix.tex, which allows text align at left only).
References are put in the bib file. You can use common reference management software to create a bib file. Or you can search through Google Scholar and click on cite under the paper and choose bibtex form, then copy and paste the information into a bib file.
link-citations: yes allows you click on a citation/table/figure and jumpo to the corresponding location.
csl files are journal style files and can be a url like here or a local file.

R chunks

In my project set up, I have .Rproj file in the project folder, then I have R, Doc, etc. folders. R scripts are located within the R folder while the Rmarkdown file for the manuscript is in the Doc folder. By default, when you knit the Rmarkdown file with Rstudio, it will treat the folder where the Rmarkdown file seated as work space even though the .Rproj file is in the folder one level up. This is a little bit annoying because the path will be different if you click the knit button from running part of the chunks within Rstudio.

To let Rstudio know that we want use the parent folder as the work space, we can add this chunk at the beginning of the file:

```{r knitr_options, echo=FALSE}
library(knitr)
opts_knit$set(root.dir = normalizePath("../"))
```

Then in a separate R code chunk, we can use source R script with source("R/script.r").

Citations

After put the sources you want to cite in a .bib file, we can cite them in the main text. The idea is that each source will have one unique key, and you can cite it with the key. See the rmarkdown website for details and examples.

Here is a statement [@key1; @key2].
@key3 did something.
Some examples [e.g. @key4; @key5; but see @key6]

Cross-references

Tables

Insert tables by knitr::kable function (:: tells that the kable function is from knitr package in R. Then cross-reference it back with: see Table \@ref(tab:tableName), which will return something like see Table 1. The number of the table will depends on its order in the manuscript, therefore, whenever you reorder your tables, you do not need to worry about change their numbers by hand. And the R code chunk for the table looks like this:

```{r tableName1,results='asis', echo=F}
knitr::kable(mtcars[1:5, 1:5], booktabs = T, caption = "Caption here.")
```

The awesome R package kableExtra can be used to customize tables and figures. When render into Word document, if you used library(kableExtra) earlier, this may mess up the tables. I added the following code in the set up chunk so that the package won’t be loaded if knit to Word document.

if(knitr::is_latex_output() | knitr::is_html_output()){
  library(kableExtra)
} else {
  options(kableExtra.auto_format = FALSE) # for docx
}

Figures

Figures are very similar to cross-refer with tables. Basically, you use Figure \@ref(fig:figName) to refer to it. And you put the lable (figName here) and caption in the R code chunk:

```{r figName, fig.width=7, fig.asp=1, fig.cap="Your caption here."}
plot(x, y)
```

See more examples in the github file.

For complex figure or table captions, we can use text references. But make sure that no space at the end of the text! Otherwise, the captions won’t be repaced by the text references in Word document.

These pretty much cover most of the common features of scientific writing: citations, cross-references, tables, figures. You can checkout bookdown website for details. Even though bookdown is made for writing books, it is actually very good for writing papers too!

How do you set up Rmarkdown for writing? What tips do you have? Issues? Comments are very welcome. Or even better, you can click the pen on paper button at the top right of the post and edit it.

Thanks for reading and commenting!

Updating website with Hugo and Blogdown

Thu, 30 Mar 2017 00:00:00 +0000

My personal website has been full of weeds since I did not update it for a really long time. As I am trying to put together a package to apply for jobs, I finally get some time to update my website. The previous version of my website was build with Jekyll. However, it is a bit slow and whenever I want to creat a new post, I need to type the yaml head. (Yes, you can set up a snippet, but…). Finally, Yihui wrote an awesome R package blogdown to creat personal website with Hugo and Rstudio, which makes it so much easier to update your website and to publish new blog posts. Here is a post to briefly record how I have done it. When it is not clear, the best way is to look at source code here or here.

Install Hugo and create your site

First, go to the blogdown webpage and install it. For Mac users, you may need to install homebrew first, which you definitely should.

devtools::install_github('rstudio/blogdown')
blogdown::new_site()

Then go to the new website folder, make it to be a git repository with git init. You should also make a .gitignore file.

Yihui’s modified hugo-lithium-theme is simple and good, thus I decide to use it. You can install it with

blogdown::install_theme('yihui/hugo-lithium-theme')

I did not install it this way, instead, I used

git submodule add git@github.com:yihui/hugo-lithium-theme.git themes/hugo-lithium-theme

This will clone the theme into your themes folder locally but won’t copy it when you push it into github (because of the .gitmodules file created).

Tweak your website

Now you can put your old posts and webpages into the content folder. If you are familiar with CSS and html, then it should be straightforward.

config.yaml is the first file to change and it is self-explaining;
logo picture should go into static/images/logo.png;
CNAME file also into static, then Hugo will copy it into public folder, which is the generated website at the end;
- anything within static folder will be copied to public as is: i.e. static/images/fig1.jpg will be copied as public/images/fig1.jpg.
tweak files in layout folder, e.g. partials/footer.html to change footers of your website;
if you want to close comments on some pages, put disable_comments: true in the yaml head.
because my previous posts have slightly different syntax, I need to update them one by one using some R code.
If you use Rstudio addins to creat a new post, set options(blogdown.subdir = "content") first, so then if you select subdirectory as en and title as title, then Rstudio will create the file as content/en/date-title.md, otherwise, it will create as content/post/en/date-title.md.
- Now, you can just open Rstudio and start to write your blogs!

Publish your website

Approach 1: Push website into Github

Now, you need to create two repositories at Github: one to host the hugo folder and one to host the generated website (i.e. the public folder). Suppose you have two repositories now: username.github.io (to host generated website) and website (to host hugo code). Within your website folder:

rm -rf public # do not worry
git remote add origin https://github.com/username/website.git
git submodule add -f https://github.com/username/username.github.io.git public
# push website source code to github
git commit -am "Initial commit"
git push -u origin master

Now, you can generate the website again, either use Rstudio or terminal.

hugo
cd public
git add .
git commit -m "Build website"
git push -u origin master

Approach 2: Use Netlify

The free plan of netlify can meet all my requirements: build my website from the source, https, and custom domain. So I have deployed my website there. The good thing is that I do not need to push the public folder to github anymore. Whenever I change the source code of my website, netlify will automatically rebuild my website for me! How cool it that?

Issues

Here are some issues I still have:

I used to have my Chinese and English blogs separated; and I have two short names for Disqus comments for them. Now I have merged these two blogs into one folder, but I cannot merge their comments too. I can only choose one shortname. Any solutions?
~~In this setup (submodule for public folder), whenever I rebuild the site, almost all webpages in the public folder changed and need to commit and push to github?? Why?~~
- It turns out that Hugo will rebuild webpages that have been changed (e.g. lists of blog posts) but not all of them. So, this is not an issue anymore.
to be updated.

Useful links

Why no p-values in mixed models

Mon, 22 Jun 2015 00:00:00 +0000

For many traditional statistic modeling techniques such as linear models fitted by ordinary least squares (e.g. t-tests, ANOVA), we can derive exact distributions (e.g. t-distribution) for some statistics calculated from the data under null hypothesis; and then use these distributions to perform hypothesis tests on the parameters or calculate confidence intervals. It is tempting to believe that all statistical tech should provide a packaged results (e.g. p-values), but they do not. For example, you may have noted that summaries for model objects fitted with lmer list standard errors and t-statistics for the fixed effects, but no p-values. This is not without reason.

Early mixed-effects model methods used many approximations based on analogy to fixed effects ANOVA. For example, variance components were often estimated by calculating certain mean squares and equating the observed mean square to the corresponding expected mean square. In this way, we cannot handle multiple factors such as subjects and items associated with random effects as well as unbalanced data. Fortunately, it is now possible to evaluate the maximum likelihood or the REML estimates of the parameters in mixed-effects models (this is the case for R package lme4) to move further (e.g. handle unbalanced data, nested design, crossed random effects, etc.). However, the temptation to perform hypothesis tests using t-distribution or F-distributions based on certain approximation of the degrees of freedom in these distributions persists.

An exact calculation may be possible for a comparatively simple model applied to exactly balanced data set. In real world, data often are unbalanced and models can be complicated. The distribution of the test statistic when the null hypothesis does not even have t-/F-distribution (or may not even know, 1). The formulas for the degrees of freedom for inferences based on t-/F-distributions do not apply in such cases (or even meaningless). In lme4, the numerators of the F-statistics are calculated as in a linear model. The denominator is the the penalized residual sum of squares divided by the REML degrees of freedom, which is n-p where n is the number of observations and p is the column rank of the model matrix for the fixed effects (Douglas Bates). All the F ratios use the same denominator. There are many approximations in use for hypothesis tests in mixed models, each leading to a different p-value, but none of them is “correct”.

Links

Why ANOVA is not the choice for non-normal data

Fri, 19 Jun 2015 00:00:00 +0000

Reading notes of Stroup, Walter W., “Rethinking the Analysis of Non-Normal Data in Plant and Soil Science”, Agronomy Journal 107, 2 (2015), pp. 811.

Some history: Fisher and Mackenzie (1923) published the first ANOVA results. Nelder and Wedderburn (1972) introduced generalized linear models, a major departure in approaching non-normal data. Breslow and Clayton (1993) and Wolfinger and O’Connell (1993) integrated mixed models and generalized linear mode theory and methods. The following two decades saw intense development of GLMM theory and methods.

ANOVA rests on three assumptions: independent observations (vs correlated observations), normally distributed data (vs non-normal data), and homogeneous variance (vs heterogeneous variance). However, non-normal data are common in most cases, e.g. count (Poission or Negative binomial), time of flowing (Exponential or Gamma), continuous proportion such as leaf area affected (Beta), quadrats observed out of n quadrats (Binomial). For all non-normal distributions, their variance depend on the mean. Thus, if data are non-normal, chances are their variance are not homogeneous. Traditionally, the Central Limit Theorem assures that sampling distribution of means will approximately normal if sample size is large enough. Standard variance-stabilizing transformations are used to deal with heterogeneous variances, e.g. log(count + 1), sqrt(small_count + 3/8), count^(2/3), asin(sqrt(proportion)). GLMMs extended the linear model theory to accommodate data the may be non-normal, have heterogeneous variance, and be correlated. On the GLMMs point of view, ANOVA is antiquated or even obsolete.

Stroup (2015) showed that ANOVA with untransformed and log-/sqrt-transformed count data and GLMM all control Type I error adequately, but GLMMs have more power to detect treatment differences; for discrete proportion data, untransformed ANOVA yields estimates of the marginal $p_i$ but not the correct standard errors, the GLMM yields estimates of the conditional $p_i$ and correct standard errors, the arc sine transformed ANOVA does not provide estimates of either.

Take a binomial example: the ith treatment in the jthe block with $N_{ij}$ yes-no observations and probability $p_{ij}$ of a yes response on any given ijth observation unit. Three distributions relevant to the analysis of these experimental data.

The distribution of block effects (random effects). Blocking is a design strategy to ensure that units within blocks are as similar as possible. Variability among blocks are expected and we assume the blocks are representative of blocks we could have used. Thus variation among blocks is assumed to be a normal distribution: $b_j\sim NI(0,\sigma_{B}^{2})$ (normal and independently).
The distribution at the unit level: observations in the ij unit ~ $Binomial(N,p_{ij})$. This distribution conditional on the random effects. $y_{ij}|b_j\sim Binomial(N,p_{ij})$: the distribution of the observations, conditional on the observation being in the jth block, is binomial distributed (with N and $p_{ij}$).
The actually observed distribution: the marginal distribution. When we say we have binomial data, we are referring to the distribution of the observations conditional on the ijth unit. The distribution of observed data–the marginal distribution–is most likely not binomial distributed.

The first two distributions, we cannot observed directly. The only distribution we observed is the third one. This is not an issue if the first two are normal distributions as the third will also be normal. For all other non-normal data, the marginal distribution of the observed data is quite different. Our usual intuitions can betray and mislead. The fundamental problem of analyzing non-normal data is that what we want to estimate or test (in this example, the treatment effects on $p_{ij}$ of binomial data) involves parameters of distributions that we cannot directly observed. In another word, the information we want are camouflaged in a complex observed marginal distribution. GLMMs can extract the information we want from the observations we have but not ANOVA and regression.

The GLMM conditional estimate asks: “if I take an average number of the population, which means a member of the population whose block effect $b_j=0$, what is the estimated binomial probability?” (think about median value). The marginal estimate asks: “if I average across all the members of the population, what is the mean proportion?” (think about mean value). Which one to use depends on your questions.

Stroup (2015) argues for binomial data, ANOVA with or without transformation should be considered unacceptable for publication. If the marginal mean best address the research objectives, the correct approach requires an alternative formulation of the GLMM, that is generalized estimating equations (GEEs, Zeger et al. 1988). GEE replaces random effects in the linear predictor with working variance and correlation and replaces the distribution with a quasi-likelihood. Assuming equal N for all experimental units, the beta GLMM is the preferred method if the marginal mean is the appropriate target. For unequal N, use the GEE.

In sum, Stroup’s (2015) main take-home message: for non-normal data, ANOVA, with or without transformed data, won’t work. The loss of accuracy and power are too great. GLMMs and, in some cases, GEEs are the methods of choice.

Youtube view counts of Linear Algebra lectures

Mon, 01 Jun 2015 00:00:00 +0000

I am learning linear algebra these days by watching the excellent series of lectures taught by Prof. Gilbert Strang at Youtube. During this journey, I think it would be interesting to look how many view count for all lectures. I expect the view counts will decline for later lectures.

Alright, first load some R packages in order to get data from Youtube.

library(plyr)
library(dplyr)
library(rvest) # for webpage scripting
library(stringr) # string handling
library(ggplot2) # plotting
library(knitr)

Then I searched online to find out the url of the playlist for all lectures. To find the correct CSS part, I followed this tutorial.

# the playlist first
url = html("https://www.youtube.com/playlist?list=PLE7DDD91010BC51F8")
lectures = html_nodes(url, ".yt-uix-tile-link")
# length(lectures) # 35 vedio
# get lecture names
lec_names = html_text(lectures) %>% 
  sapply(function(x) str_replace(x, "^.*Lec ([b0-9]*) .*", "\\1")) %>% 
  unname() %>% as.character()
lec_names[lec_names == "24b"] = 24.5
lec_names = as.numeric(lec_names)

Then, get urls for all lectures and extract their view counts.

# get url for all lectures
url_all = ldply(lectures, function(x){
  paste0("https://www.youtube.com", html_attr(x, name = "href"))
})

# for each lecture, get the view count
view_all = sapply(url_all$V1, function(x){
  print(x)
  xx = html(x)
  view_count = html_nodes(xx, ".watch-view-count") %>% html_text() %>%
    gsub(",", "", .) %>% 
    as.numeric()
  lect_descrip = html_nodes(xx, "#eow-description") %>% html_text() %>% 
    gsub("^(.*)View the complete.*$", "\\1", .) %>% str_trim()
  print(lect_descrip)
  list(view_count, as.character(lect_descrip))
})

Now, combine lecture names with their view counts.

# combine lecture names with view count
view = unlist(view_all[(1:length(view_all)) %% 2 == 1])
# remove some notes that start with *.
descrip = unlist(view_all[(1:length(view_all)) %% 2 == 0])
descrip = sapply(descrip, function(x){
 if(str_detect(x, "\\*")){
   str_replace(x, "^(.*)\\*+.*$", "\\1")
 } else{
   x
 }
})
dat = data_frame(lec = lec_names, view = view, description = descrip)
kable(data.frame(lec = lec_names, view = format(view, big.mark = ","), description = descrip), format = "html")

view	description
1,471,018	Lecture 1: The Geometry of Linear Equations.
421,456	Lecture 2: Elimination with Matrices.
359,628	Lecture 3: Multiplication and Inverse Matrices.
304,766	Lecture 4: Factorization into A = LU
210,564	Lecture 5: Transposes, Permutations, Spaces R^n.
199,004	Lecture 6: Column Space and Nullspace.
151,769	Lecture 7: Solving Ax = 0: Pivot Variables, Special Solutions.
138,087	Lecture 8: Solving Ax = b: Row Reduced Form R.
151,718	Lecture 9: Independence, Basis, and Dimension.
132,971	Lecture 10: The Four Fundamental Subspaces.
104,452	Lecture 11: Matrix Spaces; Rank 1; Small World Graphs.
82,273	Lecture 12: Graphs, Networks, Incidence Matrices.
77,250	Lecture 13: Quiz 1 Review.
108,408	Lecture 14: Orthogonal Vectors and Subspaces.
99,687	Lecture 15: Projections onto Subspaces.
96,329	Lecture 16: Projection Matrices and Least Squares.
95,593	Lecture 17: Orthogonal Matrices and Gram-Schmidt.
90,094	Lecture 18: Properties of Determinants.
79,339	Lecture 19: Determinant Formulas and Cofactors.
85,189	Lecture 20: Cramer's Rule, Inverse Matrix, and Volume.
159,954	Lecture 21: Eigenvalues and Eigenvectors.
109,883	Lecture 22: Diagonalization and Powers of A.
84,893	Lecture 23: Differential Equations and exp(At).
84,173	Lecture 24: Markov Matrices; Fourier Series.*
36,172	Lecture 24b : Quiz 2 Review.*
59,755	Lecture 25: Symmetric Matrices and Positive Definiteness.*
62,566	Lecture 26: Complex Matrices; Fast Fourier Transform.
58,041	Lecture 27: Positive Definite Matrices and Minima.
70,082	Lecture 28: Similar Matrices and Jordan Form.
85,714	Lecture 29: Singular Value Decomposition.
99,162	Lecture 30: Linear Transformations and Their Matrices.
61,037	Lecture 31: Change of Basis; Image Compression.
36,158	Lecture 32: Quiz 3 Review.
55,596	Lecture 33: Left and Right Inverses; Pseudoinverse.
50,540	Lecture 34: Final Course Review.

Finally, let’s plot it.

# plot
ggplot(dat, aes(x = lec, y = view)) +
  geom_point(color = "red", size = 2) + 
  geom_line(color = "blue") +
  labs(x = "Lectures", y = "Youtube view count",
       title = "Youtube view counts of Linear Algebra lectures taught by 
       Gilbert Strang, Srping 2005")

Wow, the first lecture has 1,471,030 by far (2015-06-21-23:00 Central Time)! However, the view count of the second lecture is about one million lower than the first one. It will be interesting to find out why lecture 21 and 22 have more view counts than their neighbors (I am getting their, at lecture 14 now! – Eigenvalues!). The last lecture has about 50K views. Does this mean about 50K people finished all lectures?

It clearly shows how hard it is to be persistent.

Some useful keyboard shortcuts for Atom editor

Fri, 10 Apr 2015 00:00:00 +0000

I am trying to switch to Github’s new editor Atom. Here is a note about things I found useful for me.

Packages

To see all packages installed, run apm list in your terminal. I used the following packages so far:

atom-material-syntax # great syntax highlighting
atom-material-ui # great user interface
autocomplete-bibtex # autocomplete citations
autocomplete-paths # autocomplete path of files
file-icons # show file icons in the tree view
git-time-machine # compare git files
ink # for julia language
julia-client #
language-julia #
language-latex #
language-markdown #
markdown-preview-plus # render math equations
markdown-writer # make writing in markdown easier
minimap # show minimap of your file
minimap-find-and-replace # show finded items in minimap
pen-paper-coffee-syntax #
project-manager #
terminal-panel # run terminal within Atom
typewriter #
vim-mode # I like the vim mode of moving cursor
wordcount #
Zen # distraction free

To install all of them: apm install atom-material-syntax atom-material-ui autocomplete-bibtex autocomplete-paths file-icons git-time-machine ink julia-client language-julia language-latex language-markdown language-r markdown-preview-plus markdown-writer minimap minimap-find-and-replace pen-paper-coffee-syntax project-manager terminal-panel typewriter vim-mode wordcount Zen

Shortcuts

Multi-cursor

I also like the multi-cursor feature from sublime text, which I feel is a must for me. Shortcuts within Atom:

ctrl-D if you select a world, then you hit ctrl-D and Atom will select next same word for you. Then you can either type directly (which will replace the old word) or use left or right arrow to append things.
ctrl-leftclick you can use this to select locations for multi-cursor wherever you want.
shift-alt-down or shift-alt-up to put multi-cursor at multiple lines. Or you can select multiple lines first, then selection -- split into lines (in Mac, you can use cmd-shift-L, sadly, for windows and linux so far, no similar shortcut for this [in sublime, we can use ctrl-shift-L].).

These pretty much cover most of usage of multi-cursor, but I still missing shift-rightclick_and_drag feature from sublime text.

Spell check

To enable spell check for Latex files, go to setting and find the spell-check package, add text.tex.latex in the grammer filed.

Common used

shift + f11: full screen, distration free from the Zen package.
ctrl + \: toggle tree view.
ctrl + /: toggle comment.
ctrl + shift + up/down: move line up/down.

update soon

Notes for Zoo 540 Theoretical ecology (Part I)

Fri, 12 Dec 2014 00:00:00 +0000

Simulation is critical to understand what your methods are doing! Try to simulate your dataset before doing any statistical analysis.

Grouse data

The data have presence/absence of four bird species at 117 route. Each route has 8 stations distributed along the 1 mile by 1 mile border evennly. The data also include environmental data at each station, including wind speed, temperature, noise, etc. The question is “what factors are controlling species abundance and distribution?”.

Simulation

It is alway a good idea to simulate your dataset first before do statistical analysis. Here, we choose species WITU, wild turkey as an example. (code from Tony Ives) The key info here is how to do a compund distribution simulation.

d  # the dataset in long table form: each row is an observation
w  # aggregated at each route, using `FUN = mean`.
# I decided I wanted to generate data that had the appropriate
# variability in counts per ROUTE. This variability can be seen in the
# following histogram.
hist(w$WITU)

# As a first attempt, assume that each observation at each STATION is
# random and independent of all other stations, including those
# stations in the same ROUTE. The mean number of observation
# (presences) across all STATIONs is
mWITU <- mean(d$WITU)

# Therefore, I produced a data set that has the same structure as d in
# which WITU is selected from a binomial distribution with probability
# = mWITU and size = 1 (size is the number of trials).

sim.d <- subset(d, select = ROUTE:Y_NAD83)
sim.d$WITU <- rbinom(n = dim(d)[1], size = 1, prob = mWITU)

# Now I treat sim.d just like d to get the histogram I'm interested in
sim.w <- data.frame(aggregate(cbind(sim.d$WITU, sim.d$X_NAD83, sim.d$Y_NAD83), 
    by = list(sim.d$ROUTE), FUN = "mean"))
names(sim.w) <- c("ROUTE", "WITU", "X_NAD83", "Y_NAD83")

# Finally, I compare the distributions. Run this code (starting with
# the subset() function above) several times to convince yourself these
# distributions are different.
op = par(mfrow = c(2, 1))
hist(w$WITU)
hist(sim.w$WITU)

## Because there is more variation in the data than in the first
## simulation, I decided to assume that ROUTEs had different
## probabilities of WITU being observed in STATIONs. Specifically, for
## each ROUTE, I assumed that the probability of a WITU being observed
## at a station was prob, and that prob is distributed according to an
## exponential distribution among ROUTEs.  This is an example of a
## compund distribution: the probability from a binomial distribution is
## itself described by an exponential distribution.

sim.d <- subset(d, select = ROUTE:Y_NAD83)

# This uses a for() loop that loops through the levels of sim.d$ROUTE.
for (route in levels(sim.d$ROUTE)) {
    n <- sum(sim.d$ROUTE == route)
    prob <- rexp(n = 1, rate = 1/mWITU)
    sim.d$WITU[sim.d$ROUTE == route] <- rbinom(n = n, size = 1, prob = prob)
    sim.d$route.mean[sim.d$ROUTE == route] <- prob
}

# Or ROUTE to be beta distribution first --> beta-binomial distribution
shape1 <- 1
shape2 <- (1 - mRUGR) * shape1/mRUGR

sim.d <- subset(d, select = ROUTE:DATE)
for (route in levels(sim.d$ROUTE)) {
    n <- sum(sim.d$ROUTE == route)
    prob.route <- rbeta(n = 1, shape1 = shape1, shape2 = shape2)
    sim.d$RUGR[sim.d$ROUTE == route] <- rbinom(n = n, size = 1, prob = prob.route)
}

# Again, I generate sim.w like w, although I've also added a column for
# the value of prob from each ROUTE and called it route.mean.
sim.w <- data.frame(aggregate(cbind(sim.d$WITU, sim.d$X_NAD83, sim.d$Y_NAD83, 
    sim.d$route.mean), by = list(sim.d$ROUTE), FUN = "mean"))
names(sim.w) <- c("ROUTE", "WITU", "X_NAD83", "Y_NAD83", "route.mean")

# Run this a few times to convince yourself that the simulations do a
# pretty good job reproducing the data
op = par(mfrow = c(3, 1))
hist(w$WITU)
hist(sim.w$WITU)
hist(sim.w$route.mean)

# for betabinomial distribution, we can also estimate the MLL of prob
# first, then simulate the data

# Probability distribution function for a betabinomial distribution
# modified from the library 'emdbook'
dbetabinom <- function(y, prob, size, theta, shape1, shape2, log = FALSE) {
    if (missing(prob) && !missing(shape1) && !missing(shape2)) {
        prob <- shape1/(shape1 + shape2)
        theta <- shape1 + shape2
    }
    v <- lfactorial(size) - lfactorial(y) - lfactorial(size - y) - lbeta(theta * 
        (1 - prob), theta * prob) + lbeta(size - y + theta * (1 - prob), 
        y + theta * prob)
    if (sum((y%%1) != 0) != 0) {
        warning("non-integer x detected; returning zero probability")
        v[n] <- -Inf
    }
    if (log) 
        v else exp(v)
}

# Log-likelihood function for the betabinomial given data Y (vector of
# successes) and Size (vector of number of trials) in terms of
# parameters prob and theta
dbetabinom_LLF <- function(parameters, Y, Size) {
    prob <- parameters[1]
    theta <- parameters[2]
    -sum(dbetabinom(y = Y, size = Size, prob = prob, theta = theta, log = TRUE))
}

LLestimates <- optim(fn = dbetabinom_LLF, par = c(prob = 0.2, theta = 0.5), 
    Y = w$OBS, Size = w$STATIONS, method = "BFGS")
# w$OBS : how many birds observed in all stations from one route
# w$STATIONS: 8 stations / route.
LLestimates
# $par prob 0.177216

# update parameters
shape1 <- LLestimates$par[1] * LLestimates$par[2]
shape2 <- (1 - LLestimates$par[1]) * LLestimates$par[2]

# This uses a for() loop that loops through the levels of sim.d$ROUTE.
sim.d <- subset(d, select = ROUTE:DATE)
for (route in levels(sim.d$ROUTE)) {
    n <- sum(sim.d$ROUTE == route)
    prob.route <- rbeta(n = 1, shape1 = shape1, shape2 = shape2)
    sim.d$RUGR[sim.d$ROUTE == route] <- rbinom(n = n, size = 1, prob = prob.route)
}

## Or we can even include envi variales.

# Log-likelihood function for the betabinomial given data Y (vector of
# successes), Size (vector of number of trials), and independent
# variable X (WINDSPEEDSQR) in terms of parameters prob and theta
dbetabinom_LLF <- function(parameters, Y, Size, X) {
    theta <- parameters[1]
    b0 <- parameters[2]
    b1 <- parameters[3]
    
    # inverse logit function
    prob <- 1/(1 + exp(-b1 * (X - b0)))
    -sum(dbetabinom(y = Y, size = Size, prob = prob, theta = theta, log = TRUE))
}

LLe <- optim(fn = dbetabinom_LLF, par = c(theta = 0.5, b0 = 1, b1 = -0.5), 
    Y = w$OBS, Size = w$STATIONS, X = w$WINDSPEEDSQR, method = "BFGS")
LLe
# $par theta b0 b1 5.6803877 -3.7789926 -0.3007611

## Simulating data for RUGR

# Set up parameters
theta <- LLe$par[1]
b0 <- LLe$par[2]
b1 <- LLe$par[3]

# This uses a for() loop that loops through the levels of sim.d$ROUTE.
sim.d <- subset(d, select = ROUTE:DATE)
for (route in levels(sim.d$ROUTE)) {
    p <- 1/(1 + exp(-b1 * (w$WINDSPEEDSQR[w$ROUTE == route] - b0)))
    shape1 <- p * theta
    shape2 <- (1 - p) * theta
    n <- sum(sim.d$ROUTE == route)
    prob.route <- rbeta(n = 1, shape1 = shape1, shape2 = shape2)
    sim.d$RUGR[sim.d$ROUTE == route] <- rbinom(n = n, size = 1, prob = prob.route)
}

# Compute statistical significant of H0:b1=0 (effect of WINDSPEEDSQR)

# Log-likelihood function for the betabinomial given data Y (vector of
# successes), Size (vector of number of trials), and independent
# variable X (WINDSPEEDSQR) with b1 = 0 in terms of parameters prob and
# theta
dbetabinom_LLF <- function(parameters, Y, Size, X) {
    theta <- parameters[1]
    b0 <- parameters[2]
    
    # inverse logit function
    prob <- 1/(1 + exp(b0))
    -sum(dbetabinom(y = Y, size = Size, prob = prob, theta = theta, log = TRUE))
}

LLe0 <- optim(fn = dbetabinom_LLF, par = c(theta = 0.5, b0 = 1), Y = w$OBS, 
    Size = w$STATIONS, X = w$WINDSPEEDSQR, method = "BFGS")
LLe0

c(LLe$value, LLe0$value)
# [1] 179.9093 181.9467 negative likelihood
pchisq(2 * (LLe0$value - LLe$value), df = 1, lower.tail = FALSE)
# 0.04352663

Maximum likelihood

# Likelihood function for a Bernouli process generate data
n <- 10
p <- 0.8
set.seed(123)
xi <- rbinom(n = n, size = 1, prob = p)

L <- function(pp) apply(X = array(pp), MARGIN = 1, FUN = function(ppp) prod(xi * 
    ppp + (1 - xi) * (1 - ppp)))
LL <- function(pp) apply(X = array(pp), MARGIN = 1, FUN = function(ppp) sum(log(xi * 
    ppp + (1 - xi) * (1 - ppp))))

par(mfrow = c(1, 1), lwd = 2, bty = "l", las = 1, cex = 1.5)
curve(L, from = 0, to = 1, main = paste("p = ", p, "mean(x) = ", mean(xi), 
    " n = ", n))

Here, the maximum likelihood is at x = 0.7, which is the mean(x) not the true probability p. This is because the maximum likelihood is estimated from the actual data, not the TRUE underlying probability that we always do not know.

Confident interval

How do you calculate confident interval? In basic statistical classes, we were told to use mean +- 1.96*SE. But this way is a special case for general way since normal distribution is symmetric.

The general way works like this, using binomal distribution as an example: we know the mean proportion of success in the data as p_hat = x/n. Then we propose a prob value, say 0.3, then we simulate n numbers from a binomial distribution with “true” propbability prob = 0.3. We then can calculate the propbability that p_hat generated the simulated values using the simulated distribution. If this value is less than 0.025, then the prob value proposed is not within the 95% confident interval of the true probability of our acutual data. Repeat this procedures… Probably just look at the code:

# Confidence intervals for a Bernouli process generate data
n <- 100
p <- 0.5
xi <- rbinom(n = n, size = 1, prob = p)

# Compute estimate
p_hat <- mean(xi)

# Plot estimator
op = par(mfrow = c(1, 1), lwd = 2, bty = "l", las = 1, cex = 1.5)

lower_cum <- function(p_est, pp, n) pbinom(q = p_est * n - 1, size = n, 
    prob = pp)
upper_cum <- function(p_est, pp, n) 1 - pbinom(q = p_est * n, size = n, 
    prob = pp)

pp <- 0.3
W <- function(pp, n) cbind((0:n)/n, dbinom(x = 0:n, size = n, prob = pp))
plot(W(pp, n), type = "h", main = paste("p=", pp, "p_hat=", p_hat, "lower=", 
    0.001 * round(1000 * lower_cum(p_hat, pp, n)), "upper=", 0.001 * round(1000 * 
        upper_cum(p_hat, pp, n))), xlab = "estimate", ylab = "probability")
points(p_hat, 0, col = "red")

In this case, 0.3 is not within the 95% CI of p_hat.

# Confidence intervals for a Bernouli process generate data
n <- 100
p <- 0.5
xi <- rbinom(n = n, size = 1, prob = p)

# Compute estimate
p_hat <- mean(xi)

# Plot estimator
op = par(mfrow = c(1, 1), lwd = 2, bty = "l", las = 1, cex = 1.5)

lower_cum <- function(p_est, pp, n) pbinom(q = p_est * n - 1, size = n, 
    prob = pp)
upper_cum <- function(p_est, pp, n) 1 - pbinom(q = p_est * n, size = n, 
    prob = pp)

pp <- 0.6
W <- function(pp, n) cbind((0:n)/n, dbinom(x = 0:n, size = n, prob = pp))
plot(W(pp, n), type = "h", main = paste("p=", pp, "p_hat=", p_hat, "lower=", 
    0.001 * round(1000 * lower_cum(p_hat, pp, n)), "upper=", 0.001 * round(1000 * 
        upper_cum(p_hat, pp, n))), xlab = "estimate", ylab = "probability")
points(p_hat, 0, col = "red")

In this case, 0.6 is within the 95% CI. Repeat this procedure, we can get the 95% CI for p_hat.

# numerically find confidence intervals
alpha <- 0.05

toMin_lower <- function(pp) (lower_cum(p_hat, pp, n) - alpha/2)^2
toMin_upper <- function(pp) (upper_cum(p_hat, pp, n) - alpha/2)^2

upper_alpha <- optim(p_hat, toMin_lower)$par
lower_alpha <- optim(p_hat, toMin_upper)$par

par(mfrow = c(1, 2))
pp <- upper_alpha
plot(W(pp, n), type = "h", main = paste("p=", 0.001 * round(1000 * pp), 
    "lower=", 0.001 * round(1000 * lower_cum(p_hat, pp, n)), "upper=", 
    0.001 * round(1000 * upper_cum(p_hat, pp, n))), xlab = "estimate", 
    ylab = "probability")
points(p_hat, 0, col = "red")

pp <- lower_alpha
plot(W(pp, n), type = "h", main = paste("p=", 0.001 * round(1000 * pp), 
    "lower=", 0.001 * round(1000 * lower_cum(p_hat, pp, n)), "upper=", 
    0.001 * round(1000 * upper_cum(p_hat, pp, n))), xlab = "estimate", 
    ylab = "probability")
points(p_hat, 0, col = "red")

# Test confidence intervals
n <- 500
p_true <- 0.7
nexpts <- 1000
countOutside <- array(0, c(nexpts, 6))
for (expt in 1:nexpts) {
    p_hat <- (1/n) * sum(rbinom(n = n, size = 1, prob = p_true))
    if (p_hat == 0) {
        lowerbound <- 0
        lowerconverge <- 0
    } else {
        lower_alpha <- optim(p_hat, toMin_upper)
        lowerbound <- lower_alpha$par
        lowerconverge <- lower_alpha$value > 10^-4
    }
    if (p_hat == 1) {
        upperbound <- 1
        upperconverge <- 0
    } else {
        upper_alpha <- optim(p_hat, toMin_lower)
        upperbound <- upper_alpha$par
        upperconverge <- upper_alpha$value > 10^-4
    }
    
    countOutside[expt, ] <- c(p_true <= lowerbound, p_true >= upperbound, 
        lowerbound, upperbound, lowerconverge, upperconverge)
}
c(mean(countOutside[countOutside[, 5] == 0, 1]), mean(countOutside[countOutside[, 
    6] == 0, 2]))

## [1] 0.022 0.030

colMeans(countOutside)

## [1] 0.0220 0.0300 0.6597 0.7378 0.0000 0.0000

head(countOutside, n = 10)

##       [,1] [,2]   [,3]   [,4] [,5] [,6]
##  [1,]    0    0 0.6474 0.7265    0    0
##  [2,]    0    0 0.6743 0.7513    0    0
##  [3,]    0    0 0.6515 0.7303    0    0
##  [4,]    0    0 0.6619 0.7399    0    0
##  [5,]    0    0 0.6722 0.7494    0    0
##  [6,]    0    0 0.6371 0.7169    0    0
##  [7,]    0    0 0.6722 0.7494    0    0
##  [8,]    0    0 0.6392 0.7188    0    0
##  [9,]    0    0 0.6805 0.7571    0    0
## [10,]    0    0 0.6474 0.7265    0    0

Analysis of the grouse data

Goal: Estimating the effect of WINDSPEEDSQR on observations of RUGR. There are many ways to do this:

a likelihood ratio test
linear regression with data transformation
LMM
GLM
GLMM
a parametric bootstrap test

Note: Always use quasibinomial or quasipoisson got GLMs. In GLMM, (1|id) will allow the variation to be larger than the distribution allowed, i.e. similar as quasibinomial or quasipoisson and it will be like the residuals in the linear regression, absorbing all remaining unexplained variations.

## (i) a likelihood ratio test Probability distribution function for a
## betabinomial distribution from the library 'emdbook'
dbetabinom <- function(y, prob, size, theta, shape1, shape2, log = FALSE) {
    if (missing(prob) && !missing(shape1) && !missing(shape2)) {
        prob <- shape1/(shape1 + shape2)
        theta <- shape1 + shape2
    }
    v <- lfactorial(size) - lfactorial(y) - lfactorial(size - y)
    -lbeta(theta * (1 - prob), theta * prob)
    +lbeta(size - y + theta * (1 - prob), y + theta * prob)
    if (sum((y%%1) != 0) != 0) {
        warning("non-integer x detected; returning zero probability")
        v[n] <- -Inf
    }
    if (log) 
        v else exp(v)
}

# Log-likelihood function for the betabinomial given data Y (vector of
# successes), Size (vector of number of trials), and independent
# variable X (WINDSPEEDSQR) in terms of parameters prob and theta
dbetabinom_LLF <- function(parameters, Y, Size, X) {
    theta <- parameters[1]
    b0 <- parameters[2]
    b1 <- parameters[3]
    
    # inverse logit function
    prob <- 1/(1 + exp(-b1 * (X - b0)))
    -sum(dbetabinom(y = Y, size = Size, prob = prob, theta = theta, log = TRUE))
}

LLe <- optim(fn = dbetabinom_LLF, par = c(theta = 0.5, b0 = 1, b1 = 0.5), 
    Y = w$OBS, Size = w$STATIONS, X = w$WINDSPEEDSQR, method = "BFGS")

# Compute statistical significant of H0:b1=0 (effect of WINDSPEEDSQR)

# Log-likelihood function for the betabinomial given data Y (vector of
# successes), Size (vector of number of trials), and independent
# variable X (WINDSPEEDSQR) with b1 = 0 in terms of parameters prob and
# theta
dbetabinom_LLF0 <- function(parameters, Y, Size, X) {
    theta <- parameters[1]
    b0 <- parameters[2]
    
    # inverse logit function
    prob <- 1/(1 + exp(b0))
    -sum(dbetabinom(y = Y, size = Size, prob = prob, theta = theta, log = TRUE))
}

LLe0 <- optim(fn = dbetabinom_LLF0, par = c(theta = 0.5, b0 = 1), Y = w$OBS, 
    Size = w$STATIONS, X = w$WINDSPEEDSQR, method = "BFGS")
LLe0

c(LLe$value, LLe0$value)
pchisq(2 * (LLe0$value - LLe$value), df = 1, lower.tail = FALSE)


## (ii) LMM for the presence of RUGR at stations (ignoring the binary
## nature of the data)
library(lme4)
# Make variable in d for mean WINDSPEEDSQR (to give a fair comparison
# between methods at the station vs. route levels)
d %>% group_by(ROUTE) %>% mutate(meanWind = mean(WINDSPEEDSQR))

lmer(RUGR ~ WINDSPEEDSQR + (1 | ROUTE), data = d)
lmer(RUGR ~ meanWINDSPEEDSQR + (1 | ROUTE), data = d)
# To get p-values, you can use Anova in library(car)
library(car)
Anova(lmer(RUGR ~ WINDSPEEDSQR + (1 | ROUTE), data = d))
Anova(lmer(RUGR ~ meanWINDSPEEDSQR + (1 | ROUTE), data = d))

## (iii) LMM for the number of observations per route (arcsine
## square-root transformed)
w$tOBS <- asin((w$OBS/w$STATIONS))^(0.5)
summary(lm(tOBS ~ WINDSPEEDSQR, data = w))

## (iv) GLM for the presence of RUGR at stations
summary(glm(RUGR ~ WINDSPEEDSQR, family = "quasibinomial", data = d))
summary(glm(RUGR ~ meanWINDSPEEDSQR, family = "quasibinomial", data = d))
# two-tailed p-value (alpha = 0.05) for a t distribution

## (v) GLM for the number of observations per route
summary(glm(cbind(OBS, STATIONS - OBS) ~ WINDSPEEDSQR, family = "binomial", 
    data = w))
summary(glm(cbind(OBS, STATIONS - OBS) ~ WINDSPEEDSQR, family = "quasibinomial", 
    data = w))

## (vi) GLMM for the presence of RUGR at stations
glmer(RUGR ~ meanWINDSPEEDSQR + (1 | ROUTE), family = "binomial", data = d)
Anova(glmer(RUGR ~ meanWINDSPEEDSQR + (1 | ROUTE), family = "binomial", 
    data = d))

id <- as.factor(1:dim(d)[1])
glmer(RUGR ~ meanWINDSPEEDSQR + (1 | ROUTE) + (1 | id), family = "binomial", 
    data = d)

## (vii) GLMM for the number of observations per route
glmer(cbind(OBS, STATIONS - OBS) ~ WINDSPEEDSQR + (1 | ROUTE), family = "binomial", 
    data = w)
## (viii) a parametric bootstrap test assuming the distribution of
## observations per route is betabinomial
library(emdbook)

# Estimated ('true') value of b1
b1_true <- LLe$par[3]
# Estimated values of b0 and theta under the H0: no effect of windspeed
theta_true0 <- LLe0$par[1]
b0_true0 <- LLe0$par[2]
# Bootstrap simulation under H0
nreps <- 2000
est_b1 <- array(0, c(nreps, 1))
for (rep in 1:nreps) {
    sim.w <- w
    for (route in levels(w$ROUTE)) {
        p <- 1/(1 + exp(b0_true0))
        sim.w$OBS[w$ROUTE == route] <- rbetabinom(n = 1, size = w$STATIONS[w$ROUTE == 
            route], p = p, theta = theta_true0)
    }
    sim.LLe <- optim(fn = dbetabinom_LLF, par = c(theta = 0.5, b0 = 1, 
        b1 = 0.5), Y = sim.w$OBS, Size = sim.w$STATIONS, X = sim.w$WINDSPEEDSQR, 
        method = "BFGS")
    est_b1[rep] <- sim.LLe$par[3]
}


# Histogram of bootstrap distribution of the estimator of b1
hist(est_b1)
abline(v = b1_true, col = "red")
lines(c(b1_true, b1_true), c(0, nreps), col = "red")

# P-values
pvalue.onetailed <- mean(est_b1 < b1_true)
pvalue.onetailed
pvalue.twotailed <- 2 * pvalue.onetailed
pvalue.twotailed

Hemlock data

For a group variable, if data in each group only have a small range of values (e.g. clustering data distribution in each group), say group 1 has values from 10-20, group 2 has 20-30, etc. then it is not good to analyze at group level. Instead we should combine all groups together to analyze them. On the other hand, if each group has wide range of data, then it should be fine to analyze at groyp level.

A simple R function to compress pictures

Sun, 30 Nov 2014 00:00:00 +0000

I do not know too much about picture compression. There must be better ways/packages to do this. This small project is just for fun.

First, here is a function to blur a picture. It will use the mean value of all cells in a submatrix as value for each cell of that submatrix.

library(png)
library(parallel)

filter.img = function(mat, k = 1) {
    pad.mat <- matrix(0, dim(mat)[1] + 2 * k, dim(mat)[2] + 2 * k)
    pad.mat[(k + 1):(dim(mat)[1] + k), (k + 1):(dim(mat)[2] + k)] = mat
    pad.mat2 = matrix(0, dim(pad.mat)[1], dim(pad.mat)[2])
    for (i in (k + 1):(dim(mat)[1] + k)) {
        for (j in (k + 1):(dim(mat)[2] + k)) {
            pad.mat2[i, j] = mean(pad.mat[(i - k):(i + k), (j - k):(j + k)])
        }
    }
    pad.mat2[(k + 1):(dim(pad.mat2)[1] - k), (k + 1):(dim(pad.mat2)[2] - k)]
}

Then let’s read the picture below. Then we seperate the red, green, and blue arrays of the picture.

# read picture and get red, green, blue arrays
str(vg <- readPNG("Van_Gogh_Wheatfield_with_Crows.png"))
red.vg <- vg[, , 1]
green.vg <- vg[, , 2]
blue.vg <- vg[, , 3]
filter.vg = list(red.vg, green.vg, blue.vg)

Then here is the function that will do the compression on each array and combine together.

# blur red, gree, and blue and then combine together.
final.png = function(lst = filter.vg, k = 1) {
    out.filter.vg = mclapply(lst, function(x) filter.img(x, k = k), mc.cores = 3)
    out.array = array(unlist(out.filter.vg), dim = c(dim(lst[[1]])[1], dim(lst[[1]])[2], 
        3))
    writePNG(out.array, target = paste("dli55_", k, ".png", sep = ""))
}

Ok, let’s try different extents of compression.

final.png(k = 1)

final.png(k = 3)

final.png(k = 5)

Maximum likelihood estimation of normal distribution

Wed, 08 Oct 2014 00:00:00 +0000

The probability density function of normal distribution is: \[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}} \]

Support we have the following n i.i.d observations: $x_{1},x_{2},\dots,x_{n}$. Because they are independent, the probability that we have observed these data are: \[ f(x_{1},x_{2},\dots,x_{n}|\sigma,\mu)=\prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x_{i}-\mu)^{2}}{2\sigma^{2}}}=(\frac{1}{\sigma\sqrt{2\pi}})^{n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}} \]

\[\begin{array}{cl} \log(f(x_{1},x_{2},\dots,x_{n}|\sigma,\mu)) & =\log((\frac{1}{\sigma\sqrt{2\pi}})^{n}e^{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}})\\ & =n\log\frac{1}{\sigma\sqrt{2\pi}}-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\\ & =-\frac{n}{2}\log(2\pi)-n\log\sigma-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2} \end{array}\]

Let’s call $\log(f(x_{1},x_{2},\dots,x_{n}|\sigma,\mu))$ as $\mathcal{L},$ then let: \[ \frac{d\mathcal{L}}{d\mu}=-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\mid_{\mu}=0 \] solve this equation, we get \[ \frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(2\hat{\mu}-2x_{i})=0 \]

Because $\sigma^{2}$ should be larger than zero, \[ \hat{\mu}=\frac{\sum_{i=1}^{n}x_{i}}{n} \]

Similarly, let \[ \frac{d\mathcal{L}}{d\sigma}=-\frac{n}{\sigma}+\sum_{i=1}^{n}(x_{i}-\mu)^{2}\sigma^{-3}=0 \]

I realized that it would be better to get the maximum likelihood estimator of $\sigma^{2}$ instead of $\sigma$. Thus

\[ \hat{\sigma}^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\hat{\mu})^{2}}{n} \]

But this MLE of $\sigma^{2}$ is biased. A point estimateor $\hat{\theta}$ is said to be an unbiased estimator of $\theta$ is $E(\hat{\theta})=\theta$ for every possible value of $\theta$. If $\hat{\theta}$ is not unbiased, the difference $E(\hat{\theta})-\theta$is called the bias of $\hat{\theta}$.

We know that \[ \sigma^{2}=Var(X)=E(X^{2})-(E(X))^{2}\Rightarrow E(X^{2})=Var(X)+(E(X))^{2} \]

Then \[ \begin{array}{cl} E(\hat{\sigma}^{2}) & =\frac{1}{n}E(\sum_{i=1}^{n}(x_{i}-\hat{\mu})^{2})\\ & =\frac{1}{n}E(\sum x_{i}^{2}-n\hat{\mu}^{2})\\ & =\frac{1}{n}E(\sum x_{i}^{2}-\frac{(\sum x_{i})^{2}}{n})\\ & =\frac{1}{n}\left\{ \sum E(x_{i}^{2})-\frac{1}{n}E\left[(\sum x_{i})^{2}\right]\right\} \\ & =\frac{1}{n}\left\{ \sum(\sigma^{2}+\mu^{2})-\frac{1}{n}\left[n\sigma^{2}+(n\mu)^{2}\right]\right\} \\ & =\frac{1}{n}\left\{ n\sigma^{2}+n\mu^{2}-\sigma^{2}-n\mu^{2}\right\} \\ & =\frac{n-1}{n}\sigma^{2}\\ & \neq\sigma^{2} \end{array} \]

Bias is $E(\sigma^{2})-\sigma^{2}=-\frac{\sigma^{2}}{n}$. In fact the unbiased estimator of $\sigma^{2}$ is $s^{2}=\frac{\sum_{i=1}^{n}(x_{i}-\hat{\mu})^{2}}{n-1}$. But the fact that $s^{2}$ is unbiased does not imply that $s$ is unbiased for estimating $\sigma$. The expected value of the square root is not the square root of the expected value. Fortunately, the biase of $s$ is small unless the sample size is very small. Thus there are good reasons to use $s$ as an estimator of $\sigma$.

ESA 2014 -- day 2

Tue, 12 Aug 2014 00:00:00 +0000

ESA is in full swing now at its second day.

I went to two ignite sessions about tools and tips for working with ecological data, though I did not stay there for the whole sessions. Amber Budden introduced and walked through DataOne, which is a public data repository. Carly Strasser then gave us four tips about making high quality data. You can download her slides here. There were also some other cool talks but I missed them. Especially the talk given by Karthik Ram about rOpenSci project.

David Storch gave a fantastic talk about relationships between species richness and number of individuals. Based on the energy hypothesis, more energy available will lead to more NPP, thus more individual stems, ending with more species. However, David et al found that the variation of individual numbers is not responsible to variation of species richness, though number of individuals may contribute to species richness regulation. However, this probably dependents on spatial scale and clade.

Vigdis Vandvik’s lab did some really cool plant transplantation experiments. In alpine grassland of Norway, they transplanted turfs to warmer and wetter locations. They also transplanted turfs within the same site to serve as control. Then they looked at species composition and community abundance weighed trait value for the turfs and controls and their neighbor quadrats. They found that precipitation does not matter too much in their system, even precipitation varied from 600 - 3000 mm. Instead, temperature explained most of the variations. It is also interesting that the species composition changes as well as two traits (leaf area and seed mass) actually not differ from expected by chance. However, SLA and maximum height did differ significantly from random expectation. The bigger in initial dissimilarities among transplanted turfs and the sites the turfs transplanted to, the larger of the changes in functional traits of plants in that transplanted turf. So, I guess that filtering effects are really strong here!

Peter Adler analyzed coexistence of five perennial plants using Chesson 2000’s framework and long-term demographic data. They found that stabilizing niche differences among these five species were large but fitness differences were small. They also found that niche of recruitment is the most important factor in their model. Intraspecific competition in these systems is stronger than interspecific competition.

Tadashi Fukami found priority effects between yeast and bacteria in flower nectar. Pollinators do not like bacteria but like yeast in nectar. Thus yeast have negative effects on bacteria, which then have negative on plant-pollinator relationships (i.e. more bacteria, less pollinator visitation). Thus, yeast have positive effects on plant-pollinator relationships! In order to understand the microbial effects of nectar on plant-pollinator relationship, we need to study microbial community assembly! Neat story. I did not think too much about community assembly at this scale, but it is much easy to conduct manipulation experiments at this scale!

It seems that you sorted out all things and you do not need my help.

This sentences made me a really nice day, thanks, Nick.

ESA 2014 -- day 1

Mon, 11 Aug 2014 00:00:00 +0000

My first ever ESA meeting began today! Yes, I am excited, but also overwhelmed! Since I got to Sacramento very late last night (at 11pm local time, which was 1am in midwest, where I came from), I did not get up early enough this morning. As a result, I missed the awards session and talk of keynote speaker Kathy Cottingham.

As a first year ESA attendee, I was really glad that I went to the orientation for student attendees organized by Kika Tarsi. We got some general idea about how the meeting was structured, how to get the most out of a ESA meeting. We also got lots of practice about “elevator pitch”. Really helpful session. But it could be better if the organizer can give attendees a little bit more time when practicing.

After three years, I finally met with Elizabeth Borer! I applied for Dr. Borer’s lab three years ago for Ph.D study but did not get into her lab. Before our meeting, I actually did not have good idea about what we are going to talk. But it was a very very nice talk at the end! I showed a little bit about projects I am working on (yes, I am a nerd…) then talked about plans after graduation and got some really helpful suggestions. We also talked a little bit about how the nutrient network came from and how it evolved! After all other useful suggestions and tips, there is actually one tip about meetings Dr. Borer talked: mainly use this kind of big meeting as an opportunity to meet people you want to meet, as well as listen a couple of good talks each day.

After more than one hour (!) talk with Dr. Borer, I went to a talk given by Marko J. Spasojevic about intra-specific functional trait. Intraspecific traits got more and more attention during the past five years in the trait-based community studies. Marko and Jonathan Myers are working on a 20 hectares plot at the Tyson Research Center at Washington University in St. Louis. They divided the area into 20m by 20m grids. When they analyzed the data at this scale, including intraspecific traits did not increase the variations the model can explain. Then, they analyzed the data at 40m by 40m and 60m and 60m scales. With bigger grids, beta diversity among grids was getting lower, i.e. more sharing species among grids. But then including intra-specific traits improved the model. Interesting results, but why? Marko did not talk too much about reasons for this phenomeno. Is it because that at smaller scale, individual species have very similar traits (thus not too much intraspecific traits)? With increasing spatial scale, species are likely getting different because of environmental gradients, thus intra-specific traits get more important to explain the community assembly patterns? If so, then what is the threshold? Is this system specific?

I then run into an ignite session about climate change. It is super cool to have this kind of talk. About 5 minutes for each speaker and no Q&A at the end. With this limited time, people are mostly focus on big pictures. It is nice to listen a couple of these. But after three of them, I did not feel that I learned things I want to learn (do not get me wrong, this session’s goal is actually to inspire ideas).

In the exhibit hall, I came across one of my previous labmate from China. It was very nice to catch up with her in US! At almost the end of the post session, I happened to meet with Ben Bolker! But he was in a hurry to the theoretical ecology section mixer. Hopefully I will get some more time to talk with him later.

I never have been in an ESA student mixer section! It was fun to talk with other fellow students in the bar, get to know some new people and told Kika that she did not introduce herself in the student orientation session! One most interesting thing is that I met a sophomore undergraduate student in the student mixer! I am sorry, are people really attending ESA meeting so young??? I suddenly felt that I was so late/old…

Well, I guess I can call it a good day now! Tomorrow, I am going to meet with Nicholas J. Gotelli. I am really looking forward it!

P.S. Two thoughts about ESA meeting so far:

Why no blank pages at the end of the program book? Sometimes, I do not have a notebook with me and I want to take notes!
The name tag should be printed two-sided so I do not need to worry about whether people can see my name though it is easy just to hand write my name at the other side.

First experience with CHTCondor

Sat, 21 Jun 2014 00:00:00 +0000

I am diving into lots of random subsampling a big quadrats by species matrix and then doing some analysis on each random subsampled matrix. For example, we have a big dataset which includes vegetation survey data for three vegetation types and two time periods for each vegetation type. For each vegetation type and time period, in order to standardize sampling effort, I need to sub sample 28 sites with 20 quadrats for each site 5000 times. This will give me 5000 * 6 = 30000 quadrats by species matrix. For each of these matrix (560 rows and 100-300 columns), I need to do some analysis and this will cost about 5 - 10 minutes (involving with null models: reshuffling each matrix another 5000 times to get effect size and p-value). I am doing all of this using R. As R is RAM limited, this job is far far more than what I can handle with my laptop or desktop (6Gb RAM with 4 cores).

Fortunately, we have a center for high throughput computing (HTC) on campus. Specifically, the HTCondor Project for my case. The idea is that split the job into lots of n small pieaces of jobs and then send them to n computers (wither campus-wide or national-wide). Then each computer will finish that small job and send results back. Since I am using R for all analyses, I also need to ship the R program and extra packages I used with data and code to these n computers. As a result, we need follow two steps to make it. However, the help page is not documented very well. Thought the computing facilitator is very nice and at the first meeting, s/he problably will teach you how to use the system. But I feel at the first time, if you have no previous experience about using ssh before like me, you probably will be overwhelmed. Here I recorded my experiences with using R with HTCondor, step by step.

Before doing anything with HTCondor, you need to get access to it. Talk with your computing facilitator to set up an account for you.

Building R in the submit node. Following the help page will be fine here. What bothered me is what/where is a submit node? Where should I run the code to build R?. I still not quite sure about where is the submit node. Is it the home directory ~ or the directory where my data and actually code for analysis located? I just run the R building code in my home directory.
1. First, download source code of R packages you needed for your analysis (if any, e.g. vegan_2.0-10.tar.gz). Then transfer them to your home directory: scp *.tar.gz user@submit-3.chtc.wisc.edu:.. * matches anything, . at the end means keep names as is.
2. Then login into your account and build R: chtc_buildRlibs --rversion=sl5-R-3.0.1 --rlibs=permute_0.8-3.tar.gz,vegan_2.0-10.tar.gz. The order of packages matters, as the help page said. Libraries that are called by other libraries should be listed first.
Then we are going to step two. Again, just follow the help page.
1. Make a directory to hold everything for your project and set it as working directory.
```
mkdir project-name
cd project-name
```
2. Download the ChtcRun package and unpack it
```
wget http://chtc.cs.wisc.edu/downloads/ChtcRun.tar.gz
tar xzf ChtcRun.tar.gz
```
3. Transfer your data and files into directories within ChtcRun directory, following the help page. Using the Rin directory came with the ChtcRun package as an example. You put your analysis R code for each job in the Rin/shared directory. Also, copy the two compact files produces from step one into this file folder!!! This point is not on the help page! The two files, in my case, are sl5-RLIBS.tar.gz and sl6-RLIBS.tar.gz. So within the Rin directory, move them use cp ~/sl*-RLIBS.tar.gz shared/.. Then within Rin directory, create one file folder for each job, just follow the help page will be fine.
4. Within ChtcRun directory, submit your jobs. Here is an example: suppose all of your data and files within Rin directory and you want your output to be in the Rout directory. The R code is code.R in the Rin/shared folder. Also, for each job, you will get three files back, say a.csv, b.csv, and c.csv. And you want all your result files in a Rresult file folder. Then you can run code as this:
```
./mkdag --data=Rin --outputdir=Rout --resultdir=Rresult --cmdtorun=code.R \
--pattern=a.csv --pattern=b.csv --pattern=c.csv \
--type=R --version=R-3.0.1
cd Rout
condor_submit_dag mydag.dag
```

That is it! Now you have submitted your jobs and you want to check the progress. You can use condor_q $USER to check how many jobs are running, how many are in quene. You can also use less mydag.dag.dagman.out in the Rout directory (then press space to view the out file page by page, or press G to go to the end of the file). It will tell you how many jobs have been done, how many are running, etc.

At the end, I want to list some important points:

In your local computer, create one file folder for each job and then put data for each job in the file folder. In my case, I need to create 30000 file folders and put each sub-matrix in. Then transfer all file folders into your account: ~/project1/ChtcRun/Rin/.
In Rin/shared directory, paste the R packages build in step one. This is because when HTCondor send your jobs to different computers, each computer needs your data, code and R packages.
You can use --pattern= to identify your results and collect them in one directory.
You need to know some basic shell commands.

English Blog on Daijiang Li

Brief notes of the iDigBio workshop

Advances in Digital Media Workshop Series: Yale

Running R on HiperGator

The problem

Solution

Problems with installing R package `arrow`

The problem

Solution

Git used wrong path of `gh`

The problem

Solution

Tensorflow and R set up on server

Library not found for `-lgfortran`

Host R packages on r-universe

The problem

Solution

Shinny App

Weird R issue caused by messed up BLAS/LAPACK libraries

Useful Vim commands

Vim commands

Normal mode to insert mode

Move cursor around

Visual mode

Edit text

Search

Search and replace

Multiple files

Multiple sections

Multiple files

Multiple windows

Multiple tabs

Others

Add Multiple Passport Photos on One Page using R

Fetching phylogenies from Phylomatic with R

List of functions from tidyverse that I do not use often

fopenmp option of clang error

Blog posts with academic styles

Citations

Math equations

R code chunk

Including plots and cross-refer it back

Table and cross-reference

References

Reading notes: phylogenetic comparative models

Phylogenetic comparative methods

Trait evolution

Lineage diversification

PCMs in different disciplines

Caveats and the future of PCMs

Notes of Data-driven Ecological Synthesis

2017/05/01

2017/05/02

2017/05/03

2017/05/04

2017/05/05

2017/05/06

2017/05/07

R packages installation issues

rgdal package

sf package

to be updated

Clipping shape files in R

Writing Academic Papers with Rmarkdown

Markdown and Rmarkdown

Packages needed

Writing with Rmarkdown

Yaml head

R chunks

Citations

Cross-references

Tables

Figures

Updating website with Hugo and Blogdown

Install Hugo and create your site

Tweak your website

Publish your website

Approach 1: Push website into Github

Approach 2: Use Netlify

Issues

`rgdal` package

`sf` package