5 min read

Notes of Data-driven Ecological Synthesis

I went to the excellent data-driven ecological synthesis summer school at the Station de Biologie des Laurentides (SBL) of the Université de Montréal, organized and taught by Timothée Poisot and Dominique Gravel. The station is one of the best research station I have ever been: great view, nice staffs, and excellent food! The teachers are very approachable and very knowledgeable. Classmates are very nice to each other and we had lots of fun together. For example:


Thanks all for a great week.

Here is my very brief note during this one-week class.


  • What is data? Observations of variables have value and unit.

    • meta data: when, who, how, why, intel. property?
  • Data plan? (NSF funded: data one, data life cycle https://www.dataone.org/data-life-cycle) (talked about 50 mins)

    1. collect
    2. assure: quality control:
    3. describe: meta-data?
    4. preserve: backup, ask computer center of University; figshare, etc. Be careful with Dropbox if you have government data etc. long-term archive. Who can have access?
    5. discover: identify data you need, which not necessary collected by yourself.
    6. integrate: put different temporary/spatial scales data together
    7. analysis: overview of the data analyses to conduct.
  • exercise: 2-3 people/group, read a paper selected by themselves, discuss 2-3 steps of the data life cycle, how they did that? weakness? good? 20-30 mins.

  • Be serious about data archive/integration when applying for funding / writing grant reviewing.

  • Ten Simple Rules for Creating a Good Data Management Plan

  • Ten Simple Rules for Digital Data Storage

  • Spreadsheet: flat files

    • type SEP3, sept03, or sep03; and excel turned it into 3-Sep or 9/3/2017. Even save as csv file at the end, they are all 3-sep, not what you typed in.
    • tidy data: every column as variable, every row as an observation.
    • NO THINGS: no merging cells, no color, no blank cells (be explicit about missing data and other possible issues that will result in missing data), no single information (no multiple tables)
    • dates: YYYY-MM-DD-HH-MM-SS-TZ or split into date, time, and time zone.
    • Location coordinates: be explicit about the format.
  • Template: use template to input data at the beginning of projects; when explain the variables, be explicit about possible values or rules to record. For example, how to name a site; for species, use Latin names; format of dates; etc.

  • Exercise: everyone creates a template for their own projects. 30 mins.


  • OpenRefine (morning)

    • explore different datasets: facets, transform of cells, filtering of rows, transform cells, explore scatter plots, e.g. [value, cells["mo"].value, cells["dy"].value].join("-")
    • input datasets by multiple urls.
      • json files, select “rows” instead of “records” to make life easier.
  • Jupyter notebook + R (afternoon)


  • Morning
    • Group discussion about mandatory data sharing/open (for and against, 2 groups, morning 45 mins)
      • debate.
      • For: drive to a better science system (system > individuals)
      • Against: unfair (synthesis vs data collectors;)
    • Data sets and API (request/url and responses/json object); rOpenSci project/packages.
  • Afternoon
    • Discussion about possible projections till 3pm
    • Dom gave a talk about how public data can do. (Beyond the checklist: the biogeography of ecological interaction networks)
      • biogeograph: spatial and temporal distribution of species and abundance, including causes and consequences.
      • the dominant conceptual tool in biogeograph: the niche.
      • Is resource availability constant across gradients?
      • predation pressure constant across gradients?
      • how do covary interaction strength and pop abundance
      • what about highly diverse communities?
      • A community is more than a checklist
      • how do we move from a regional meta web to a local web?
      • revise biogeograph by including species interaction
      • Gravel et al 2011 Ecol. Lett.
      • OBIS: marine occurrence data set.
      • fishbase: fish characteristics.
      • connectance very high in global Marian fish networks
      • how do you control for data quality? with huge datasets, the impact of errors may be not too problematic. More importantly, with complex pipeline of scripts, be careful about possible programming errors. defensive programming
      • be careful about sensitivity of data analyses to data quality.
    • talking about designing database.
      • be defensive when design: for example, set types of possible inputs (characters, small integers, etc. error control), api design (JavaScript), advantages of api: security, portability, remote working.


  • Morning

    • Dom. Gravel suggested books
      • An Illustrated Guide to Theoretical Ecology by Ted J. Case
      • The Theoretical Biologist’s Toolbox: Quantitative Methods for Ecology and Evolutionary Biology by Marc Mangel
    • Rational data bases

      • advantages: efficiency, security, remove redundancy, faster query, allow multiple users work on the dataset at the same time
      • SQL: structural query language

        SELECT sphote AS host, sppar AS parasite, COUNT(sppar) AS number, AVG(a) AS a
        FROM morphometry 
        WHERE host is "Disa"
        GROUP BY sphote, sppar
        HAVING number > 3
        ORDER BY number DESC
        LIMIT 4
      • SQL ecology data carpentry

  • Afternoon

    • Brief about projects to work on. (4 projects, and I work on my own project)
    • Work on projects



  • Work on project the whole day.


  • Morning
    • Work on project; started group presentations at 10:30am, till 12pm.
  • Afternoon
    • Back to Montreal at 3:30pm.