flowchart LR
A("<s>Data Collection</s>") --> B("<s>Data Cleaning</s>")
B --> C(Data Analysis)
C --> D(Data Visualization)
D --> E(Data Communication)
The goal for this week to build a foundation that allows us to jump into data analysis in the coming weeks.
After completing this unit, you will be able to
git commit and git push.fork) other users’ files on GitHub.com.git clone command to download a copy of a GitHub repository to your computer.We can describe the data analysis workflow as a linear sequence of steps (Figure 1):
Figure 1: Environmental Data Analysis Flowchart; Credit: NumberAnalytics
However, this workflow is embedded into a larger context of doing open-science.
Open and reproducible science is a collection of practices (Figure 2) that allow us to easily share, work and collaborate with others1.
Benefits of openness and reproducibility in science include:
Hey ISAT 420 Students,
I found this cool dataset online and I want you to check this out. The data is attached to this email.Best wishes, TG
Reflection Questions:
data.csvIt is impossible to know from the csv that I shared with you, what the data represents.
The dataset I shared is the global carbon budget for the year 2023 (Figure 3), which is compiled by the Global Carbon Project:
This dataset is actually fairly well described and documented. This includes the website and an article in the open-access journal: Earth System Science Data1
Ok, we now know where the data is coming from, what is represents, and that it is likely trustworthy.
So let’s think about what we can do with this. Considering our data analysis workflow (Figure 4):
flowchart LR
A("<s>Data Collection</s>") --> B("<s>Data Cleaning</s>")
B --> C(Data Analysis)
C --> D(Data Visualization)
D --> E(Data Communication)
How can we do this in a transparent and reproducible way?
We can use Jupyter Notebook to read, process, and visualize our data. The most simple code for our analysis might be something like:
Step 1: Read the dataset
Step 2: Make some calculations to process our data
Step 3: Make a visualization of our data
Step 4: Save the plot as a file
The code is found on GitHub: https://github.com/ISAT-DrG/ISAT_420_S26_Shared
Remember, being reproducible is not all or nothing. Let’s see which criteria we are currently doing well with …
… and what we could do to improve.
The code I just demonstrated is part of a git repository that is hosted on GitHub.
This means that any changes to the code are tracked within the git repository and the code is also quasi-archived.1
We use git and GitHub as a version control system (Figure 5):
This also means that different users can make independent changes to the same document (Figure 6):
If there are conflicts between the users’changes, these conflicts can be resolved by choosing which changes you want to keep (Figure 7):
Let’s say we want to make a change in our analysis. This change should be transparent and reproducible.
We can use git and GitHub for this.
Step 1: Make and save changes
Step 2: Commit change
Step 3: Push our changes to GitHub
We can now examine the changes in the Github Repository (or also using GitHub-Desktop).
Activity: Create a repository using GitHub Desktop and make a change
File \(\rightarrow\) NewRepositoryShow in Explorer (Windows) or Show in Finder (Mac)test.txtAdded a new test fileChanged file contentGitHubGitHub.comLet’s say we are doing open-science and have found some freely-available code on GitHub that we want to build on. What can you do to get started?
Using GitHub.com, you can make a copy of a GitHub repository (also known as a repo) owned by another user or organization (a task referred to as forking a repository).
To fork a repo:
Navigate to the repo that you wish to fork. Example:
https://github.com/ISAT-DrG/ISAT_420_S26_Shared
On the upper right corner you will see a button that says Fork (Figure 8)
fork button in the upper right hand corner of your screen. You can then create a copy of of this repo in your account.
Click on the Fork button and select your user account when it asks you where you want to fork the repo.
You can now use GitHub-Desktop to create a local copy of the repository on your computer.
File \(\rightarrow\) Clone RepositoryLocal pathCloneOften times, there is a central repository with code that many people are contributing to. This includes many of the Python packages that are part of the Pangeo ecosystem. To manage this collaboration, individual contributors are working on their own forked copies of the project. When they are ready to share (or suggest) a change to the overall project, they can use a Pull Request (Figure 9).
A Pull Request is a workflow that allows
For now, we don’t need to know this workflow in detail. However, it is described in Chapter 8, Lesson 2 of the CU-Boulder Intro to Earth Data Science Textbook (2025)
Skills/ Tool Check
Can you
commit the change, and push the change to GitHub?fork and clone a shared repository so that you have the contents on your computer?
w3_basic_python_test.ipynbThe content and activities on this page are in a large part based on the CU-Boulder EarthLab Intro to Earth Data Science Textbook (2025). Relevant chapters are: