GithubHelp home page GithubHelp logo

rse-classwork's People

Contributors

dpshelio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rse-classwork's Issues

Reading and writing structured data files

The purpose of this exercise is to check that you are able to read and write files in commonly-used formats.

Make sure you have read the course notes on structured data files first!

For this exercise, you will work with the code you wrote to represent a group of friends during the last lesson (#7).
If you prefer, you can use the sample solution instead of yours, especially if you have used a custom class in your solution.

You can use either JSON or YAML for this exercise - the choice is up to you.

Write some code to save your group to a file in the format of your choice (JSON or YAML).
Make sure you can read the file by loading its contents. Is the result identical to the original structure?

If you have questions or run into problems, leave a message on the Q&A forum on Moodle.

Writing a test for `times.py`

This exercise will look at understanding what a given python function (times.py) does, and writing a test to check that it works as expected.

  1. Fork the times-tests repository and clone it on your computer.
  2. Read the description of the exercise in the README file.
  3. Start by reading the code in times.py, understanding what it does, and running it (before making any modifications to it).
  4. The next step consists of converting the __main__ part of the code into a unit test.
  5. Check that your test passes by running pytest.
  6. When you are happy with your solution (or want some feedback!):
    1. Push your new code to your own fork.
    2. On GitHub, open a pull request from your fork to the original repository.
    3. In the description, include the text Answers UCL-MPHY0021-21-22/RSE-Classwork#16. This will list your PR to this issue.
    4. On the PR text, comment on what you found difficult or interesting, or something you learned.
  7. Choose one of the other pull requests listed on this issue, and leave a review. Comment on things you find interesting or don't understand, any problems you think you spot, good solutions, or potential improvements.
  8. Mark the assignment on Moodle as complete.
  9. Think about what other aspects of times.py should be tested and report them on the Moodle questionnaire.

If you have questions or get stuck, ask on Moodle or book an office hours slot!

Sample solution

Automating `git bisect` - part I

In your homework (#25), you've seen that even for just 24 commits (and there can be many more), you need to type quite a few, repetitive git bisect commands to find the commit you're looking for.
It's therefore something that is useful to automate. The Solving automatically section in the notes may be useful. Given the same situation as in this week's homework.

Let's go through some steps in the next issues:

Step 0

๐Ÿคฉ If you've attempted the homework, your repository may be in a bisecting state.

Therefore, run the following to keep everyone in the same point:

git bisect reset # to make sure you are not in a bisecting state
git switch main  # to go back to Charlene's original point
git reset --hard HEAD # to remove any modifications done to the tracked files

๐Ÿ™ˆ If you've not tried the homework yet, then clone Charlene's repository locally:

git clone [email protected]:UCL-MPHY0021-21-22/sagittal_average.git

Step 1

In a new file (test_sagittal_brain.py) create an input and expected numpy arrays to test Charlene's code.

Take a look to the diagram on the previous issue (#25) to understand what Charlene is trying to do.

Think why the current input (brain_sample.csv) and output (brain_average.csv) files that Charlene's been using to test her code are not very useful, and create new ones that could highlight any common error in this type of data manipulation.

Hint

Which array can you create that will produce different average values for each different row?

React to this issue with a ๐Ÿ‘€ when your team has completed the task.

Finding bugs in history

Charlene Bultoc has just started a post-doc at an important neuro-science institute. She is doing research on a new methodology to analyse signals in our brains detected through a combination of CT and MRI. Using image processing techniques she can simplify the whole dataset into a grid of 20x20 arrays.

Diagram of the analysis

Her theory is that the average of such signals through the sagittal plane is constant over time, so she has written some software to calculate this. She decided to write that software in Python so she could share it (via GitHub, sagittal_average) with people from other labs. She didn't know as much Python when she started as she does now, so you can see that evolution in her program.

Charlene is an advocate of reproducibility, and as such she has been keeping track of what versions she's run for each of her results. "That's better than keeping just the date!" you can hear her saying. So for each batch of images she processes she creates a file versions.txt with a content like:

scikit-image == 0.16.2
scikit-brain == 1.0
git://git.example.com:brain_analysis.git@dfc801d7db41bc8e4dea104786497f3eb09ae9e0
git://github.com:UCL-MPHY0021-21-22/sagittal_average.git@d8bc3ebaecd0cc7a2872da4c81d30b56f9b746ad
numpy == 1.17

With that information she can go and run the same analysis again and again and be as reproducible as she can.

However she's found that sagittal_average has a problem... and she needs to re-analyse all the data since that bug was introduced. Running the analysis for all the data she's produced is not viable as each run takes three days to execute - assuming she has the resources available in the university cluster, and she has more than 300 results.

In all the versions of the program, it reads and writes csv files. Charlene has improved the program considerably over the time, but kept the same defaults (specifically, there are an input file, brain_sample.csv, and an output file, brain_average.csv). She has always "tested" her program with the brain_sample.csv input file provided in the repository. However (and that's part of the problem!), the effect of the bug is not noticeable with that file.

We can then help her either by letting her use our laptops or (better) by finding when the bug was introduced and then run only the ones that need to be re-analysed.

Finding when the bug was introduced seems the quickest way. Download the repository with her sagittal_average.py script and use git bisect to find the commit at which the script started to give wrong results.

Do it manually first (as explained in this section of the notes).

Steps to help Charlene:

  1. Fork Charlene's repository and clone your fork.
  2. Run the latest version of the code with the existing input file
  3. Create a new input file to figure out what the bug is
    Hint: You can generate an input file that does show the error using the code snippet below:
    import numpy as np
    
    data_input = np.zeros((20, 20))
    data_input[-1, :] = 1
    np.savetxt("brain_sample.csv", data_input, fmt='%d', delimiter=',')
    You may need to create the brain_sample.csv file each time you move through the commits.
  4. Use bisect manually until you find the introduction of the error. Take note of the hash and date of the commit that introduced the bug - you will need this information in class.
  5. How would you fix the bug?

Creating a ๐Ÿ๐Ÿ“ฆ with โ„น, ๐Ÿ‘ท and ๐Ÿ“š

Improve Charlene's package even further by adding basic information, a documentation website and the config to run the tests automatically on Github Actions.

  1. Choose who in your team is sharing now! (make sure you've got a fork and a local copy from Charlene's repository)

  2. Write three files that will make this library sharable, citable and descriptive.

  3. Create a .github/workflows/pytest.yml file to run the test automatically each time something it's pushed to the repository (See also solutions to the #19 exercise.

  4. Optional: As we did last week, generate a documentation website using sphinx. (Using the githubpages sphinx extension and pushing the build directory into a gh-pages branch will show you the documentation in the repository's website)

  5. Share your solution even if it's a work-in-progress as a pull request to Charlene's repository mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#39 in the pull request description), remember to mention your team members too! (with @github_username)

Refactoring - Part 2

This follows on from #43.

Stage 2: Using a Person Class

We will now look at at how to represent and manipulate the person data with our own Person class.

Instead of each person being a dictionary, we will represent them with the class that has methods for dealing with the connections. We will restructure our code so that the functions become methods of the class. You may also wish to refer to the course notes on object-oriented design.

One example of the starting point for the structure is the file initial_person_class.py.
We have implemented some methods for these classes, but not everything that is required (the remaining methods have pass instead of actual code).

Your task:

  1. You should already have the files from the previous part.
  2. Fill in the remaining code in initial_person_class.py so that the file works as before.
  3. Run the file to make sure the assertions are still satisfied.
  4. Commit your changes.
  5. Create a pull request from your branch to the original friend-group repository and use the text in the description to link your PR to this issue Answers UCL-MPHY0021-21-22/RSE-Classwork#44

Automating `git bisect` - part VI

Continuation from #30.

Now that you have a script that tests Charlene's script we can start to find when the error was introduced!

Step 6

Let's run our new script on the current state of Charlene's project and see what happens:

  • Make sure you are in Charlene's last commit (i.e., d8bc3eb), you can check it with git log.

  • Run git bisect start to start the bisect process (it should not produce any output)

  • Run your script: python test_sagittal_brain.py

  • In which state is the code? Did it fail (bad) or everything seemed fine (good)?
    Run the git bissect <state> <id-commit>

    Note

    If you are in this commit, you can use HEAD as the id

Now let's got to a point in history we believe the code was working correctly.

  • Take a look at the history of Charlene's repository: git log --oneline
  • Jump to her second commit: git checkout <id-commit> (When she introduced data for future testing)
  • Run your script: python test_sagittal_brain.py
  • In which state is the code now, bad or good?
    Run the git bissect <state> <id-commit>

By now we should have told bisect that there is a good and a bad commit and we want to find when things started going wrong.
You can see which ones are by running git bisect log.

Hint

The output of the log should be something like:

git bisect start
# bad: [d8bc3ebaecd0cc7a2872da4c81d30b56f9b746ad] Makes the file Pep8 compliant and fixessome typos on docs
git bisect bad d8bc3ebaecd0cc7a2872da4c81d30b56f9b746ad
# good: [9dc8a27ada280e4479241c37bcb4d7f50c34ca09] Adds input and output data for future testing
git bisect good 9dc8a27ada280e4479241c37bcb4d7f50c34ca09

Let's now git bisect to find the commit that introduce the bug!

git bisect run python test_sagittal_brain.py

React to this issue with a ๐Ÿš€ when your team has completed the task.

Stretch Goal: Friend group data functions

Now that you've got a structure for your group in #7, we'd like you to create a new branch off your current group branch and create some functions.

  1. Turn your video cameras on!
  2. Choose one person who will share their screen.
  3. Create a new branch from your current group branch that starts with stretch_. e.g. stretch_dpshelio-ageorgou-stefpiatek
  4. Discuss with your group the best way to make the following functions, you can add extra parameters to the functions if you think it would be useful.
    1. forget(person1, person2) which removes the connection between two people in the group
    2. add_person(name, age, job, relations) which adds a new person with the given characteristics to the group
    3. average_age() which calculates the mean age for the group
  5. Commit your changes to your branch! (with a meaningful message)
  6. Push your changes from your computer to your fork.
  7. Create a pull request (PR) from your branch to original friend-group repository.
    Add a meaningful title to that PR and don't forget to mention your partners in the description (as @username) and a link to this issue: Answers UCL-MPHY0021-21-22/RSE-Classwork#8

Creating a ๐Ÿ๐Ÿ“ฆ with an entry point

Help Charlene run a command in her package from anywhere. You can do so by adding an entry point following the instructions below.

  1. Choose who in your team is sharing now! (make sure you've pulled from your colleague's fork!)

  2. Move the if __name__ == "__main__": block to its own file (e.g., command.py) and add it as entry point to setup.py called "sagittal_average_run"

  3. Add the dependencies of this library as requirements to setup.py.

  4. Try to install it by running pip install -e . where the setup.py is.

  5. Go to a different directory, run python -c "import sagittal_average" and see whether the installation worked.

  6. Check you can use the entry point from anywhere, by calling sagittal_average_run <path/to/input/csv> from the different directory.

  7. Share your solution as a pull request to Charlene's repository mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#38 in the pull request description), remember to mention your team members too! (with @github_username)

Plotting the earthquake dataset

Your goal is to analyse the same earthquake data as before (#13) and produce two plots, showing:

  • the frequency (number) of earthquakes per year
  • the average magnitude of earthquakes per year

To help you off, we have suggested an outline of the code. You can change this as you want, or use your own structure.

  1. Choose someone to share their screen and type. The other team members will tell them what to write.
    • Make sure that person has forked the earthquakes repository and cloned their fork locally. They should also give access to the rest of the members.
    • Make a new branch with your combined GitHub usernames, named plots-@username1-@username2-...
  2. If you are not sure you have read the data correctly, you may want to look at the sample solution. You can start from that or from one of your own answers.
  3. Take a few minutes to look at the outline given, and think about how you will structure your code. What steps do you need and how will they connect? Do you want to change the provided functions or add some more?
  4. Write some code to produce one plot.
  5. When you are finished (or have done as much as you can), push your code, and open a Pull Request to the original earthquakes repository. Include the text Answers UCL-MPHY0021-21-22/RSE-Classwork#15 in the description to link it to this issue. Add one or both plots if you want!
  6. If you have time, continue with the other plot and add it to the PR!

Some hints:

  • You can do the computations required in "plain" Python, but think about using the numpy library (the unique function or others could be helpful, depending on how you have approached the problem)
  • For plotting:
    • Make sure you have computed the values you need to plot!
    • Choose an appropriate plot type (if you need inspiration, there are various galleries) and then see how to create that in matplotlib.
    • See whether you need to put your data in a particular form to create the plot.
    • After plotting, do you need to make any visual adjustments? (on, for example, the axes, labels, colours...)
    • Save your plots to a file and check the result.

Sample solution

More unit tests!

Now that you know how to create a test, create three further tests for times.py:

Setup

Either start from your homework solution, or

Add collaborators

Add everyone in your group as collaborators to your fork.

Create three further tests

  • work collaboratively!
  • create a test each in test_times.py for:
    • two time ranges that do not overlap
    • two time ranges that both contain several intervals each
    • two time ranges that end exactly at the same time when the other starts
  • run pytest and see whether all tests are picked up by pytest and whether they pass.
  • fix any bugs in times.py the tests may have helped you find.
  • Add the new and modified files to the repository, commit them (with a meaningful comment that also includes https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/17) and push it to your fork.

Sample Solution

Creating a fixture

Separating data from code

Your parametrised tests now probably got a bit too big and difficult to read.
Create a fixture.yaml file where you can store what you parametrised before
in a more human readable way.

Load the yaml file within the test and use it that structure to feed the parametrize test.

The fixture.yaml could look like:

- generic:
    time_range_1: ...
    time_range_2: ...
    expected: 
        - ...
        - ...
- no_overlap:
    time_range_1: ...
    time_range_2: ...
    expected: []

Once you have a solution, commit it including Answers https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/22 in the message and push it to your fork on GitHub.

Sample solution

Summary data from group

Exercise done alone:

  1. Using the fork of data structure for the group of people that you created earlier for #๏ธโƒฃ7๏ธโƒฃ, clone your fork locally if you havent already (or create new fork and add the code from the example solution if your group hadn't managed to create a solution)

  2. Create a new branch from your team-named branch with your name only

  3. Add some code that makes use of comprehension expressions to your group.py file so that it prints out the following when the script is run:

    • the maximum age of people in the group
    • the average (mean) number of relations among members of the group
    • the maximum age of people in the group that have at least one relation
    • [more advanced] the maximum age of people in the group that have at least one friend
  4. Create a pull request (PR) from your branch to the original repository.
    Add a meaningful title to that PR and a link to this issue: Answers UCL-MPHY0021-21-22/RSE-Classwork#12

Sample solution (with previously given sample data structure)

Generating and Solving conflicts

  1. Read the content of the script below
  2. Run it one by one or as a script on your machine. It will create a merge conflict.
  3. Resolve the merge conflict so the text in README.md is "Hello World".
  4. Make sure your working tree is clean
  5. Create an issue in your fork with a code block showing how the file looks before and after the conflict. Add a link to this issue in the description of your issue as:
Answers UCL-MPHY0021-21-22/RSE-Classwork#3
cd Desktop/
mkdir MergeConflict
cd MergeConflict/
git init
touch README.md
echo "Hello" > README.md
git add README.md
git commit -m "first commit on main"
# if your default is not main; rename master with: git branch -m main
git checkout -b new-branch
echo "Hello World" > README.md
git commit -am "first commit on new-branch"
git checkout main
echo "Hola" > README.md
git commit -am "second commit on main: adds something in Spanish"
git merge new-branch

Automating `git bisect` - part V

Continuation from #29.

Now that you have created two arrays, can read and save them, compare expected values, and call external command from within Python.

Step 5

Bring what's needed from test_call_command.py to test_sagittal_brain.py so that when calling this script the following happens:

  1. a good input 20x20 array is created (input);
  2. an expected array with 20 elements is created (expected);
  3. the input array is saved as a csv file (brain_sample.csv);
  4. Charlene's code is executed from within this script (using subprocess);
  5. the output produced by Charlene's code (brain_average.csv) is read into output; and
  6. test that output and expected are equal.

run that script from the bash terminal.

React to this issue with a ๐ŸŽ‰ when your team has completed this task.

Approximating ฯ€ using Numba/Cython

This exercise builds on #46. It is part of a series that looks at execution time of different ways to calculate ฯ€ using the same Monte Carlo approach.

This exercise uses Numba and Cython to accomplish this approximation of ฯ€. A Numba version of the code is already written, and you can find it in calc_pi_numba.py file in the pi_calculation repository, on the class branch. Your job is measure how much time it takes to complete in comparison to #46.

Preparation

The two frameworks we will look at allow you to write Python-looking code and compile it into more efficient code which should run faster. Numba is a compiler for Python array and numerical functions. Cython is a way to program C extensions for Python using a syntax similar to Python.

Both frameworks should come with your conda installation. If not, and you get errors when running the instructions below, use conda or pip to install them (see their websites linked above for instructions).

Using Numba

  1. Look at the implementation using numba: calc_pi_numba.py
  2. Discuss how different it looks to the original. Is it more/less readable? Can you understand what the differences mean?
  3. Run the code with python calc_pi_numba.py. How does the time compare to the original?

Using Cython

Next, try to use Cython to approximate ฯ€. This part will be easier for users of Linux and OS X, as getting Cython to run on Windows is a little more involved.

We will use a notebook for this example, as it lets us see more information about how Cython works.

  1. Open the Jupyter notebook in calc_pi_cython.ipynb.
  2. As before, discuss how different the code looks to the original.
  3. Use %timeit within the notebook to compare with the runtime of the Numba version and the original code.
  4. From what you have read or know, can the Cython performance be further improved?

Argument Parsing 1/3

When writing code, it is important to think about how you or others can run it. A popular way is to use a command-line interface, so that your code can be executed as a script from a terminal. In this exercise we will look at the tools that Python offers for creating such interfaces.

We will use the squares.py file we used last week for the documentation exercise. We will make the code more generic by creating a command-line interface that will make it easier to call.

Constant weight

Let's first make our first interface without weights (assuming them constant and equal to 1).

  1. Choose who in your team is sharing
  2. Make sure you have a fork and a local copy of the average_squares repository.
  3. Open the file squares.py. Make sure you can run it from a terminal! (python squares.py)
  4. Look at the part of the file that is inside the if __name__ == "__main__": guard. This is the code that you will work on. Currently, the input values are hardcoded into the file.
  5. Use the argparse library to rewrite this part so that it reads only the numbers from the command-line (keep for now the weights hardcoded). The file should be runnable as python squares.py <numbers>... (where <numbers> should be replaced by the sequence of numbers of your choice)
    • Look at the example in the notes to get you started.
    • Decide which values should be read from the command line.
    • Add them as argparser arguments.
    • Check the auto-generated help: python squares.py --help.
    • Check that you can run the file with the new form.
  6. Share your solution as a pull request mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#32 in the pull request description), remember to mention your team members too! (with @github_username)

Learning all about Pull Requests

In small groups:

  1. Fork our Travel guide repository
  2. Clone your fork locally
  3. Create a new branch named with a combination of your team
    e.g., dpshelio-ageorgou
  4. Create a new file in the right place named after a place both of you would like to visit. Create any intermediate directory needed.
    e.g., ./europe/spain/canary_islands.md
  5. Add to that file:
    • a title (e.g., # Canary Islands)
    • a small paragraph why you would like to go there
    • End the file with a link to wikivoyage and/or wikipedia of that place.
      e.g., More info at [wikivoyage](https://en.wikivoyage.org/wiki/Canary_Islands) and [wikipedia](https://en.wikipedia.org/wiki/Canary_islands)
  6. Commit that to your branch! (with a meaningful message)
  7. Add the internal links needed to get from the main page to that one
    e.g., link from Europe's README.md to Spain's README.md, link from Spain's README.md to Canary Islands file canary_island.md
  8. Commit these changes! (with a meaningful message)
  9. Create a pull request from your branch to my repository.
    Add a meaningful title to that PR and don't forget to mention your partner in the description (as @username) and a link to this issue
Answers UCL-MPHY0021-21-22/RSE-Classwork#6

Refactoring - Part 3

This follows on from #44.

Stage 3: Object-oriented structure

We will now look at at how to represent and manipulate this data using our own user-defined objects.

Instead of a dictionary, we will define two classes, which will represent a single person and the whole group. We will restructure our code so that group functions apply directly to the group, instead of the person having all of the methods.
Again, you may also wish to refer to the course notes on object-oriented design.

Take a look at the file initial_two_classes.py to see one possible way in which the code could be structured.

Internally, the Group class still uses a dictionary to track connections, but someone using the class does not need to be aware of that. We have implemented some methods for these classes, but not everything that is required (the remaining methods have pass instead of actual code).

Your task:

  1. You should have the files from the previous parts of the exercise.
  2. Fill in the remaining method definitions.
  3. Update the section at the end of the file so that it creates the same group as in the previous example, but using the new classes you have defined.
  4. Run the file to make sure it gives the same results as before (that is, the assertions still pass).
  5. Commit your changes.
  6. Create a pull request from your branch to the original friend-group repository and use the text in the description to link your PR to this issue Answers UCL-MPHY0021-21-22/RSE-Classwork#45
  7. Think of the benefits and drawbacks of the object-oriented structure compared to the original approach (collection of functions).
  8. If you have time, think of other changes you consider useful and try them.

Parametrising the tests

Avoid code repetition using pytest.mark.parametrize

Now that you have written four different positive tests for times.py, take a step back and look at your code: There is a lot of repetition, almost every test (apart from the negative test) essentially does the same (albeit with different data), which makes our test code harder to change in the future.

We can use pytest.mark.parametrize to get our tests DRY (Don't Repeat Yourself).

  • You have seen pytest.mark.parametrize in the notes. Using the documentation of pytest.mark.parametrize if needed, see how you can compress most of the tests on a single one.

You will need the test function to accept parameters, for example time_range_1,time_range_2 and expected,
let the parametrize decorator know about it as its first argument and pass a list of tuples of length 3
with the values for each test.

Commit your solution including Answers https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/21 in the message, and push it to GitHub.

What are the advantages and disadvantages of using parametrize in this case?

Sample solution

Creating a ๐Ÿ๐Ÿ“ฆ with tests

Help Charlene to test her package (remember to commit after each step, if appropriate).

  1. Choose who in your team is sharing now! (make sure you've pulled the latest changes from your team's fork.)

  2. Create a tests directory inside sagittal_average and add a test similar to what we used last week when we discovered the bug.

    Hint

    You need a test_something function that runs all the below

    1. Create an input dataset
      data_input = np.zeros((20, 20))
      data_input[-1, :] = 1
    2. Save it into a file
      np.savetxt("brain_sample.csv", data_input, fmt='%d', delimiter=',')
    3. Create an array with expected result
      # The expeted result is all zeros, except the last one, it should be 1
      expected = np.zeros(20)
      expected[-1] = 1
    4. call the function with the files
      run_averages(file_input="brain_sample.csv",
                   file_output="brain_average.csv")
    5. Load the result
      result = np.loadtxt(TEST_DIR / "brain_average.csv",  delimiter=',')
    6. Compare the result with the expected values
      np.testing.assert_array_equal(result, expected)

    What could you do to make sure that these files we are creating don't interfere with our repository or the rest of the package?

  3. Add an __init__.py file to the tests folder.

  4. Fix sagittal_brain.py (as you may remember from last week, the code wrongly averages over the columns, not the rows), make sure the test passes and commit these changes.

  5. Try to install it by running pip install -e . where the setup.py is, and then run the tests with pytest.

  6. Share your solution as a pull request to Charlene's repository mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#37 in the pull request description), remember to mention your team members too! (with @github_username)

Approximating ฯ€ using parallelisation

This exercise builds on #46. It is part of a series that looks at execution time of different ways to calculate ฯ€ using the same Monte Carlo approach.

This exercise uses the Message Passing Interface (MPI) to accomplish this approximation of ฯ€. The code is already written, and you can find it in calc_pi_mpi.py in the pi_calculation repository, on the class branch. Your job is to install MPI, and measure how much time it takes to complete in comparison to #46.

MPI

MPI allows parallelisation of computation. An MPI program consists of multiple processes, existing within a group called a communicator. The default communicator contains all available processes and is called MPI_COMM_WORLD.

Each process has its own rank and can execute different code. A typical way of using MPI is to divide the computation into smaller chunks, have each process deal with a chunk, and
have one "main" process to coordinate this and gather all the results. The processes can communicate with each other in pre-determined ways as specified by the MPI protocol -- for example, sending and receiving data to a particular process, or broadcasting a message to all processes.

Preparation

We are going to run the original (non-numpy) version in parallel, and compare it to the non-parallel version.

We will be using mpi4py, a Python library that gives us access to MPI functionality.

Install mpi4py using conda:

conda install mpi4py -c conda-forge

or pip:

pip install mpi4py

On windows you will also need to install MS MPI.

The MPI version of the code is available at calc_pi_mpi.py. Look at the file and try to identify what it is doing -- it's fine if you don't understand all the details! Can you see how the concepts in the brief description of MPI above are reflected in the code?

Execution

  1. Run the MPI version as:
    mpiexec -n 4 python calc_pi_mpi.py
    The -n argument controls how many processes you start.
  2. Increase the number of points and proceses, and compare the time it takes against the normal version. Note that to pass arguments to the python file (like -np below), we have to give those after the file name.
    mpiexec -n 4 python calc_pi_mpi.py -np 10_000_000
    python calc_pi.py -np 10_000_000 -n 1 -r 1
    Tip: To avoid waiting for a long time, reduce the number of repetitions and iterations of timeit (1 and 1 in this example)
  3. Think of these questions:
    • Is the MPI-based implementation faster than the basic one?
    • Is it faster than the numpy-based implementation?
    • When (for what programs or what settings) might it be faster/slower?
    • How different is this version to the original? How easy is it to adapt to using MPI?

Friend group data model

  1. Turn your video cameras on!
  2. Choose one person who will share their screen.
  3. Fork the friend group repository to one of your accounts.
  4. Add everyone else in your group as a collaborator to the forked repository.
  5. Clone your fork locally.
  6. Create a new branch named with a combination of your team
    e.g., dpshelio-ageorgou-stefpiatek.
  7. Write your code in the file group.py to do what the exercise asks - see the instructions in the README file of the exercise repository.
  8. Commit your changes to your branch! (with a meaningful message)
  9. Push your changes from your computer to your fork.
  10. Create a pull request (PR) from your branch to original friend-group repository.
    Add a meaningful title to that PR and don't forget to mention your partners in the description (as @username) and a link to this issue: Answers UCL-MPHY0021-21-22/RSE-Classwork#7

If you finish all of that, you can work on our stretch goal #๏ธโƒฃ8๏ธโƒฃ

Learning branches with git

In small groups, using github's visualisation tool as shown in class, do the steps needed to replicate the repository structure shown below.

โ— hash numbers for the commit are going to be different and the final shape of the graph may look slightly different.

โœ”๏ธ When done, take a screenshot of your result and create an issue in your fork that includes the screenshot of the result, a code block with your steps and a link to this issue.

ToReplicate

To add a code block with your steps use the following syntax:

```bash
git commit -m "First commit"
git ...
```

that will render as:

git commit -m "First commit"
git ...

To refer back to this issue you need to add the following text to your issue message:

Answers UCL-MPHY0021-21-22/RSE-Classwork#2

That creates a link that will appear under this issue.

Adding a Continuous Integration system

Use CI to run the tests for you.

It's always possible that we forget to run the tests before pushing to GitHub. Luckily, continuous integration (CI) platforms can help us catch failing tests even when we do forget to run them.
In this exercise, you will use GitHub Actions to do exactly this.

Set up Github Actions to run our tests for every commit we push to our repository.

  • Get some initial familiarity with GitHub Actions by reading the quickstart guide
  • Work collaboratively!
  • Copy the code snippet below and paste it into .github/workflows/python-tests.yaml in your repository.
name: Pytest

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.6, 3.9]

    steps:
      - uses: actions/checkout@v2
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v2
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install pytest
      - name: Test with pytest
        run: |
          pytest
  • Discuss with your group what you think each of the lines in .github/workflows/python-tests.yml does
  • Add .github/workflows/python-tests.yml to the repository, commit it and push it to github. Link to this issue on that commit by including Answers https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/19 in the commit message.
  • Check whether the tests pass on your remote (Hint: check the Actions tab on your GitHub fork)

If you're done with this issue, try to add test coverage by working on this related issue: #20

Sample solution and sample solution in action.

Working with the US Geological Survey earthquake data set

This exercise will look at how to read data from an online source (web service), explore and and process it.

  1. Fork the earthquakes repository and clone it on your computer.
  2. Read the description of the exercise in the README file.
  3. Start by getting and exploring the data (step 1), then complete the code to process it (step 2).
  4. When you are happy with your solution (or want some feedback!):
    1. Push your new code to your own fork.
    2. On GitHub, open a pull request from your fork to the original repository.
    3. In the description, include the text Answers UCL-MPHY0021-21-22/RSE-Classwork#13. This will list your PR to this issue.
    4. On the PR text, comment on what you found difficult or interesting, or something you learned. If you have finished the exercise, also mention the answers you found (e.g. "The maximum magnitude is 3 and it occurred at coordinates (4.0, -3.8)."
  5. Choose one of the other pull requests listed on this issue, and leave a review. Comment on things you find interesting or don't understand, any problems you think you spot, good solutions or potential improvements.
  6. Mark the assignment on Moodle as complete, and fill in the short feedback form.

If you have questions or get stuck, ask on Moodle or book an office hours slot!

Sample solution

๐Ÿ’„๐Ÿ code - linting

  1. Analyse Charlene's package and run one of the linting tools
    You may need to install them. Which messages did you get? Was your IDE (e.g., VS Code) warning you of it already?

  2. Fix them either manually or automatically using a code formatter (e.g., yapf or black)

  3. Can you think of a way of checking the style is checked before merging new contributions? Add your suggestions below.

  4. Share your solution even if it's a work-in-progress as a pull request to Charlene's repository mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#40 in the pull request description), remember to mention your team members too! (with @github_username)

Argument Parsing 2/3

Carrying on from the previous exercise, now We will add an optional parameter to accept weights in your latest squares.py file.

With weights

  1. Choose who in your team is sharing now! (pull the code from the previous exercise into your local repository)
    Hint: You need to add a new remote from your team member and pull their branch

  2. Create a new branch from the branch used in the previous exercise.

  3. Open the file squares.py. Make sure you can run it from a terminal with some input values!

  4. Look at the part of the file that is using argparse

  5. Add a new argument that's optional and that can accept the weights as done previously with the numbers. The file should be runnable as

    python squares.py <numbers>... --weights <weights>...

    (where <numbers> and <weights> should be replaced by the sequence of numbers and weights of your choice).

    • Look at the argparse documentation
    • Add the weights as argparser arguments.
    • Check the auto-generated help: python squares.py --help.
    • Check that you can run the file with the new form, whether you put the weights or not.
  6. Share your solution as a pull request mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#33 in the pull request description), remember to mention your team members too! (with @github_username)

Using docstrings and doctests

This exercise will show why it is important to keep documentation accurate, and how to do this automatically using docstrings and doctests.

Setup

Understanding

  • Spend some time reading and understanding the code.
  • Do you understand what it's meant to do? Do the docstrings help?
  • Run the code with the default inputs. Does it produce the output you expect?
  • Try running the code with other inputs. What happens?

Exercises

As you may have discovered, the code in squares.py does contain some mistakes. Thankfully the functions in the file include documentation that explains how they should behave.

Run the doctests

  • Use the doctest module to see whether the documentation of the code is accurate: python -m doctest squares.py
  • Try to understand the structure of the output - what errors are reported, are they what you expected from looking at the code in the previous steps?

Update the docstrings

  • Look at the errors related to the average_of_squares function.
    • Figure out where the mismatch between the documentation (intended behaviour) and the actual behaviour of the function exists.
    • Correct usage examples in the average_of_squares function that are incorrect

Correct the code and verify

  • Re-run the code; again comparing the actual and expected behaviour. What is the error?
  • Correct the error in the code and rerun doctest to confirm that the average_of_squares documentation is now correct

Repeat the process for convert_numbers

  • Look at the doctest error from the convert_numbers documentation.
  • Can you identify the bug? How would you fix this?

Submit a Pull Request

Once you have completed or made progress on the exercises

  • Create a pull request (PR) from your branch to the upstream repository. Add a meaningful title to that PR and a link to this issue: Answers UCL-MPHY0021-21-22/RSE-Classwork#23

Code coverage

Knowing the coverage

Make sure you've installed pytest-cov in your environment.

  • Run the coverage and produce an html report.
  • Visualise it by opening the html report in your browser
  • Commit, push and link to this issue by including Answers https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/20 in the commit message.

Ensure Github Actions also reports your coverage!

sample solution

pytest --cov="times" --cov-report html 

and coverage in action on CI.

Automating `git bisect` - part III

Continuation from #27.

Now that you have created two arrays and be able to read and save them.

Step 3

Use numpy to test whether the output array you read in #27 is equal than the expected array you created in #26.

Hint

Numpy has a testing function to test/compare arrays: np.testing.assert_array_equal

React to this issue with a ๐Ÿ˜• when your team has completed this task.

Refactoring - Part 1

For this exercise, we will look at how to rewrite (refactor) existing code in different ways, and what benefits each new structure offers.

We will work with some code that describes a group of acquaintances, as we saw in a previous exercise (issue #7).

Stage 1: Remove global variables

Look at the initial version of the file, which defines a specific group using a dictionary and offers some functions for modifying and processing it.

You may notice that the dictionary is a global variable: all the functions refer to it but do not take it as a parameter.
This situation can lead to difficulties (why?), so we will restructure the code to avoid it.

Rewrite the functions so that they take in the dictionary that they work on as an argument.
For example, the function that computes the average age should now look like:

def average_age(group):
    all_ages = [person["age"] for person in group.values()]
    return sum(all_ages) / len(group)

Your task:

  1. Fork the friend group if you haven't already
  2. Checkout the week09 branch and go to the week09/refactoring directory.
  3. Change average_group as above, and the other functions of group.py in a similar way.
  4. Update the section at the end of the file (after if __name__ == "__main__" ) to create the sample dictionary
    there, and running of the functions that alter it.
  5. Run your file to make sure the asserts still pass.
  6. Commit your changes!
  7. Create a pull request from your branch to the original friend-group repository and use the text in the description to link your PR to this issue Answers UCL-MPHY0021-21-22/RSE-Classwork#43
  8. Think of the benefits and drawbacks of this approach compared to the original version.
  9. If you have time, think of other changes you consider useful and try them.

Automating `git bisect` - part IV

Continuation from #28.

Now that you have created two arrays, can read and save them, and compare expected values.

Step 4

Use subprocess to call a system command from within python. The aim is for later to call Charlene's programme.

On a different python file (e.g., test_call_command.py) start with this content:

import subprocess

subprocess.run(["ls", "-aF"])

and run it!

Are you using Windows and getting errors?

If you're on Windows, ideally use Git Bash (If you're on the Windows Command Prompt cmd, you need to pass cmd-compatible commands to subprocess, e.g. dir instead of ls). On Windows, you might also need to pass shell=True as an additional argument.

subprocess.run(["ls", "-lh"], shell=True)

run it and iterate a couple of times changing the command that runs with others like:

  • ls -aF (which lists files),
  • echo 'hello world' (which prints hello world in the screen),
  • date +%Y%m%d (which prints today's date with the selected format)
  • wc -l sagittal_brain.py (which counts the number of lines of the sagittal_brain.py file)
  • wc -c AFileThatDoesNot.exist (which should counts the number of characters, but should fail as the file doesn't exist)

Once you've understood how subprocess.run works, try to call Charlene's programme.

Hint

How do you call a Python script from the command line? ______ filename.py.

React to this issue with a โค when your team has completed this task.

Argument Parsing 3/3

Carrying on from the previous exercise, now we will change the options so instead of reading the numbers from the command line, they are read from text files (one number per line), and keeping the optional parameter to accept weights also as a file.

Reading data from files

  1. Choose who in your team is sharing now! (pull the code from the previous exercise into your local repository)
    Hint: You need to add a new remote from your team member and pull their branch

  2. Create a new branch from the branch used in the previous exercise.

  3. Open the file squares.py. Make sure you can run it from a terminal with some input values!

  4. Look at the part of the file that is using argparse

  5. Modify the arguments so the data is read from a text file. Where the weights are still optional. The file should be runnable as:

    python squares.py <file_numbers> --weights <file_weights>

    where <...> should be replaced by the file of your choice.

    • Look at the working with data section to refresh how we read files in Python.
    • Modify the argparser arguments to receive file names instead of numbers.
    • Check the auto-generated help: python squares.py --help.
    • Check that you can run the file with the new form, whether you pass a weights file or not.
  6. Share your solution as a pull request mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#34 in the pull request description), remember to mention your team members too! (with @github_username)

Measuring performance and using numpy

This exercise is the first in a series. The series will look at the execution time of calculating ฯ€ using the same algorithm, but different implementations and libraries.

This exercise initially uses pure Python to accomplish this approximation of ฯ€. The code is already written, and you can find it in the pi_calculation repository.
Your job is to understand the code, measure how much time it takes to complete, and then adapt it to use numpy instead of pure Python.

Step 1: Measuring how long code takes using timeit

The code uses the timeit module from the standard library. There are different ways you can use timeit: either as a module or from its own command-line interface. Check out timeit's documentation to see the different possibilities. Our calc_pi.py wraps the module implementation of timeit, and provides a similar interface to the command line interface of timeit.

Your task:

  1. Run the code with the default values using python calc_pi.py.
  2. Now run it by specifying values for some arguments (to see which arguments you can use, use --help or look at the source code )
  3. In case you would like to time a function (like calculate_pi in this case) without writing all that boilerplate, you can run
    python -m timeit -n 100 -r 5 -s "from calc_pi import calculate_pi_timeit" "calculate_pi_timeit(10_000)()"
    Try it!
  4. Try to understand the source code in more depth:
    • What does calculate_pi_timeit function do, roughly?
    • How does timeit.repeat work?
    • Why do we repeat the calculation multiple times?
    • Can you think of any changes that could make the code faster?

Step 2: Using numpy

The course notes describe how using the numpy library can lead to faster and more concise code.

Your task:

  1. Complete the file calc_pi_np.py so that it does the same as calc_pi.py, but uses numpy arrays instead of lists. Update the functions accordingly (you can change their arguments if it makes more sense for your new version).
    Hint: Instead of creating n x and y values independently, generate an array of size (n, 2).
  2. Which version of the code is faster, the one that uses numpy or the one that uses pure Python?

When you have completed the exercise, react to this issue using the available emojis, or post your comparison of times below!

Negative testing

Negative tests - Test that something that is expected to fail actually does fail

time_range may still work when end_time is before start_time, but that may make overlap_time not working as expected.

  • Work collaboratively!
  • Write a test that tries to generate a time range for a date going backward.
  • Modify time_range to produce an error (ValueError) with a meaningful message.
  • Use pytest.raises to check for that error (including the error message!) in the test.
  • Commit, push and link to this issue by including Answers https://github.com/UCL-MPHY0021-21-22/RSE-Classwork/issues/18 in the commit message.

What other similar tests could we add?

Sample Solution

Generating documentation with Sphinx

This exercise will introduce you to the basics of Sphinx using the same code you looked at in the previous exercise (#23).

Setup

  • Navigate to the average-squares folder that you used in the previous exercise.

(Note: You will be able to complete this exercise even if you haven't finished the previous one - the only difference is that some of your generated documentation will be different)

Understanding

  • This folder contains a simple project that could do with some documentation.

Exercises

Getting started with Sphinx

  • Ensure that you have Sphinx installed for your system
  • Create a docs folder within the average_squares folder - this is where your documentation for the project will be stored
  • From within the docs folder run sphinx-quickstart to generate the scaffolding files that Sphinx needs
    • Ensure that you select no for Separate source and build directories - this should be the default but if chosen incorrectly will mean your folder structure won't match up to the instructions below
    • You can accept the defaults and enter sensible information for the other fields.
  • Run sphinx-build . _build/html or make html to generate html docs.
  • Run python -m http.server -d _build/html/ and open the link shown by this command to see the built documentation in a browser.

Modifying index.rst

  • Open the index.rst file - this is the master document that serves as the entrypoint and welcome page to the documentation.
  • Add a line or two about the purpose of the project
  • Save and rebuild the documentation - verify that it builds correctly

Adding content and structure

  • In the docs folder create a subfolder called content.
  • Within docs/content create a file called average-squares-docs.rst with the following contents:
Average Squares Documentation
=============================
  • Update the toctree directive in index.rst so that this new file is included.
  • Rebuild the documentation and verify that this file is now linked to.

Using Docstrings to create documentation

As you saw in the previous exercise (#23) the code in this project contains some docstrings - let's show this in our Sphinx generated documentation

  • Follow the instruction on the Sphinx getting started page to enable the autodoc function
  • Can you modify the content/average-squares-docs.rst file to include docstrings from the code automatically?
  • Hint: You may find it useful to modify the path setup in docs/conf.py in the following way so it is easier for Sphinx to find the location of the code
# -- Path setup --------------------------------------------------------------

import os
import sys
sys.path.insert(0, os.path.abspath('..'))

Updating your PR

Commit the changes to your branch, updating the PR you created in the previous exercise. Add a comment with Answers UCL-MPHY0021-21-22/RSE-Classwork#24

Explore further features of Sphinx

There are many additional features of Sphinx - explore them if you have time. For example:

Profiling code

Even when we measure the total time that a function takes to run (#46), that doesn't help us with knowing which parts of the code are slow!

To look into that, we need to use a different too called a profiler. Python comes with its own profiler, but we will use a more convenient tool.

Setup

This exercise will work with IPython or Jupyter notebooks, and will use two "magic" commands available there. You may need some steps to set them up first.

If you use Anaconda, you should already have access to Jupyter. If you don't, let us know on Moodle or use pip install ipython to install IPython.

The %prun magic should be already available with every installation of IPython/Jupyter. However, you may need to install the second magic (%lprun).
If you use Anaconda, run conda install line_profiler from a terminal. Otherwise, use pip install line_profiler.

Using profiling tools in IPython/Jupyter notebook

prun's magic gives us information about every function called.

  1. Open a Jupyter notebook or an IPython terminal.
  2. Add an interesting function (from Jake VanderPlas's book)
    def sum_of_lists(N):
        total = 0
        for i in range(5):
            L = [j ^ (j >> i) for j in range(N)]
            # j >> i == j // 2 ** i (shift j bits i places to the right)
            # j ^ i -> bitwise exclusive or; j's bit doesn't change if i's = 0, changes to complement if i's = 1
            total += sum(L)
        return total
  3. run %prun:
    %prun sum_of_lists(10_000_000)
  4. Look at the table of results. What information does it give you? Can you find which operation takes the most time? (You may find it useful to look at the last column first)

Using a line profiler in IPython/Jupyter

While prun presents its results by function, the lprun magic gives us line-by-line details.

  1. Load the extension on your IPython shell or Jupyter notebook
    %load_ext line_profiler
  2. Run %lprun
    %lprun -f sum_of_lists sum_of_lists(10_000_000)
  3. Can you interpret the results? On which line is most of the time spent?

Finishing up

When you are done, react to this issue using one of the available emojis, and/or comment with your findings: Which function takes the most time? Which line of the code?

Creating a ๐Ÿ๐Ÿ“ฆ

Help Charlene to create her repository into a package (remember to commit after each step).

  1. Choose who in your team is sharing now! (make sure you've got a fork and a local copy of Charlene's repository)

  2. Add a .gitignore file to the repository to avoid adding artefacts created by python or your text editor. You can use gitignore.io to generate a file for your needs.

  3. Modify the repository directory structure to make sagittal_average as an installable package (don't forget to add empty __init__.py files, but there is no need to add the .md files (yet!)).

  4. Add a setup.py file with the information needed.

  5. Try to install it by running pip install -e . from the command line, in the filesystem location where the setup.py is.

  6. Share your solution as a pull request to Charlene's repository mentioning this issue (by including the text Addresses UCL-MPHY0021-21-22/RSE-Classwork#36 in the pull request description), remember to mention your team members too! (with @github_username)

  7. Congratulations, you've created a Python package! Now, let's see what can be improved about it in the subsequent issues!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.