epiverse-trace / blueprints Goto Github PK
View Code? Open in Web Editor NEWSoftware development blueprints for epiverse-trace
Home Page: https://epiverse-trace.github.io/blueprints
License: Other
Software development blueprints for epiverse-trace
Home Page: https://epiverse-trace.github.io/blueprints
License: Other
pkgdown offers 4 development modes:
auto
release
devel
unreleased
At the moment, Epiverse-TRACE packages use either auto
, or release
(the default).
At the last development meeting, we discussed how auto
could confuse users by providing two versions of the same site, possibly with conflicting information, or with information that doesn't match their installed version.
This lead us to call for a poll here to try and harmonize the use of these development mode across the project.
Here are the potential options we identified:
auto
but add visual elements (different colours, banner, etc.) to really insist on the difference between the release site and the development sitedev
and always only offer the development site, with a prompt for users to update to the latest version if they detect mismatch with their existing version. Vignettes and documentation for their current version is still available locally via help()
and browseVignettes()
.unreleased
(only one website) until the package is on CRAN and then switch to auto
(release + dev website). The rationale here is that we expect most users to install from CRAN and it would be good to direct them to the frozen release site, with an option to manually switch to the dev website if they installed from GitHub (which suggests they are probably more comfortable with R package lifecycle). This would also implement the first suggestion from this list.As part of the task force to help improve decision-making around taking on dependencies, we have arrived at a checklist that could be used. The checklist essentially sums up a lot of the recommendations in the blueprints chapter on dependencies. I am going to post it here for further input and to discuss next steps. We could potentially merge this into the section, if deemed useful.
pak::pkg_deps_tree()
for example) to understand the dependency tree?What are the plans to document the results of each discussion in the issues
section?
Since some of those discussions directly affect development, it would be nice to have a structured document that explains the adopted policies. Is this going to be part of the blueprints document? Should it be a separate one?
This checklist focuses on the aspects specific to the epiverse-trace project, and relevant to members from all institutions. Each partner institution is expected to provide their own specific onboarding process in addition to the points listed here:
I find it hard to read the text on the mindmap mainly because it uses light blue on a white background. In a well-lit room, this can be hard to read in a presentation slide, for example. Could you consider improving the contrast in the image by using darker colors for the text, for example?
Since there are many GPL R packages, the probability to having to reuse one of them is relatively high. In fact, we have found some very useful packages that are GPL.
The blueprint states that we should aim for more liberal licenses (MIT or similar). However, if we want to take advantage of all existing packages, it might be better to use GPL.
For practical reasons I would prefer to go for GPL, but I would like to have an agreement on this matter: should the blueprint be changed to explicitly indicate that GPL is also an acceptable license for our packages? Or do we have compelling reasons to stay with liberal licenses and avoid using GPL packages?
This issue is the result of a conversation we started with @joshwlambert and @pratikunterwegs.
It would be useful to brainstorm about our views on dependencies in epiverse packages. We will not reinvent the wheel and some dependencies are a no-brainer. Other dependencies are not so clear-cut:
As @joshwlambert said, beyond subjective and stylistic performances, trying to uniformise & standardise our stack can make us more efficient since we won't have to continuously switch between tools. @pratikunterwegs also mentioned that it would be problematic to introduce a dependency to a tool that only one person in the team really masters.
Relevant resources on this topic:
Hi everyone. We'd like to start a discussion on web scraping and data access of official websites, in our case from Colombia. Epidemiological data is stored in the SIVIGILA site and it cannot be reached from some countries abroad (i.e. Canada) because a "connection time out" error ocurres. This example motivates us to think in a kind of local server/website to store data (legal issues must be addreseed), or to redirect queries and act as VPN. We initially thought on preloaded datasets within the library but they are too large. What are your thoughts on this idea? or how do you think we can ensure data access to potential users?
Feedback on the blueprints suggests our target audience is not well identified, and our blueprints may have a claim to universality which would make them 'compete' with similar guidelines from other initiatives. We should clarify that:
I just noticed that any member of this repo can edit a comment written by another. I think this is potentially problematic and should be changed.
I have a few questions around Issues / PRs that I don't think have been discussed yet (except partially with @Bisaloo in private):
Once clarified these could be added to the contribution guide being edited in this PR.
https://github.com/epiverse-trace/blueprints/blob/main/principles.qmd#L69 is not formatted correctly, and thus does not provide a link when rendered to HTML.
We link out to various URLs across the blueprints, which is helpful. To ensure robust linking between different pages and sections in this repository, it would be helpful to remove hard-coded links in favor of relative links, where possible.
Examples include:
./code-review.qmd
instead of https://epiverse-trace.github.io/blueprints/code-review.html
)#addressing-package-reviews
instead of https://epiverse-trace.github.io/blueprints/code-review.html#addressing-package-reviews
)These two can also occur in combination, where linking to a relative page with section is needed (e.g., ./code-review.qmd#addressing-package-reviews
).
Making this consistent helps ensure links are robust and internal, and improve readability overall while editing.
Is there a consistent long-form documentation structure we want to enforce in epiverse? In other words, do we have a checklist that each package should comply with, for example do we want to use Pkgdown for every package, or should every package have at least one vignette? If so should this be an introductory vignette to the basic functionality? In terms of vignettes I have previously found that diversity of functionality and intended audience (R beginners vs experienced R programmers) determines how many vignettes to write. Additionally, is there a recommendation on how we should write these? I have not yet used Quarto, if anyone has experience with this it would be good to hear their thoughts and whether they recommend it. I do not have a personal preference on whether we standardise the long-form documentation.
This issue continues a discussion with @Bisaloo on Epiverse preferences for languages that are not R. The discussion began with C++ as the focus as it is expected to be the main non-R language in the Epiverse packages (see e.g. finalsize
). This is a non-exhaustive list so please feel free to add to it, to rebut aspects of it, and to recommend solutions.
Code organisation How should we organise the C++ source code that we write?
a. One aspect of this question is how much of a function should be in C++ vs being in R? While a good deal of R code can be re-written in C++ using libraries such as Eigen, this is often slower to write and requires specialist maintenance, possibly leading to it being less sustainable. Knowing these costs, what are the criteria for benefits (mostly speed improvements) that should be met before translating R code into C++ (or another language such as Julia)?
b. Another aspect of this is where C++ should live. While C++ files are usually in src/
, placing them in inst/include
has been suggested by @BlackEdder in order to make the package usable as a header-only library by other Rcpp packages (if I understand correctly) - this would be similar to Boost Headers. This would likely mean having an internal core function in a header, which is called by a wrapper function that is exposed to R and exported from the package. This is a bit more work, especially when it comes to thinking about the dependencies of future packages.
Code formatting Which code formatting guide should we follow for C++ (and other languages)?
a. The Google C++ style guide seems to be a good shout, and is implemented by both cpplint
and clang-format
.
b. @Bisaloo has suggested MegaLinter as a cross-language formatting solution.
Other miscellaneous issues
a. Should we follow other conventions, such as having a copyright statement included in C++ files? Would it be sufficient to assume that the top-level MIT Licence covers this already?
b. Should we prefer certain versions of C++ (e.g. finalsize
uses C++11), based on stability or other criteria?
suggested by @joshwlambert: Connecting these two sections with a link or expressing that "this continues there" can be an appropriate reminder for readers.
From line 50:
Lines 48 to 52 in 4fcc6e6
To this section
Lines 77 to 81 in 4fcc6e6
Both internal and external links.
Follow up of #63.
For internal links, we could re-use the infrastructure from packagetemplate for pkgdown sites.
Our github organization is growing and it will be useful to firm up policies on roles we each endorse. Options are listed in this article.
As of now, the de-facto policy has been to make every PI who has joined the organization an owner, and every RSE a member. We end up with:
While I understand this may make sense at first glance, current roles reflect overall leadership and seniority in the project, rather than the actual roles used in the github organization.
From the description of permissions for the different roles, it seems we should have a majority of members, as they can create repositories, and perform most of the tasks needed (push code, handle PR, etc.). Owners should be restricted to administrative roles, mostly for organization-wide infrastructure, billing, and handling membership (which should stabilize in the weeks to come). Owners each have the ability to delete the entire organization, which is one of the reasons why we want to reduce the number of owners. We are working on a solution for making regular, automated backups of the whole organization, but this does not entirely remove the issue of having too many owners.
Are there any views on whether we should or should not use teams? As ideally people may contribute to a variety of projects, I am not sure if we need this, but curious to hear thoughts on the topic.
I would think we may not need to set up these roles, at least initially:
Given the above, I would suggest reducing the roles of owners to a small number of people handling administrative, security, and potentially billing tasks for the github organization. Default for contributors should be 'members', with the caveat that we need to set up workflows so that RSEs have all the autonomy they need for their work.
I appreciate this may be a sensitive topic so very keen to hear the thoughts of everyone. Please share!
We use the tidyverse style for package names, which does not do formatting. This issue is mentioned by @Bisaloo in #44.
Please also double check the formatting we use for package names in other pages as I think we usually use the tidyverse style (i.e., no formatting).
I tried finding the exact rule in the tidyverse style guide before going through everything, but am unsure I found it. @Bisaloo, might you be able to point me at the original resource so I don't do double work?
Recently it was mentioned that the Epiverse-TRACE blueprints are more focused on process than on design. For a lot of the discussions we've had around design have referred back to the Tidyverse design guide.
Is it worth adding a chapter (or just a few sentences) to blueprints, similar to the Code review chapter, that states we work in accordance with the design principles laid out by the Tidyverse and then mention any differences? Or perhaps design is outside the scope of blueprints.
Would be good to hear people's thoughts on this.
The contribution guide point 5 says to edit index.qmd
but I think it should be principles.qmd
.
I start this thread to discuss how we can standardize our release process, following some conversations with @Bisaloo. It would be great to have everyone's input on this.
I can see two extremes:
I would advocate for 2, better suited for interacting with end-users, and often more realistic, as I often found sticking to plans made a long time ago is difficult, and/or the original plan becomes less relevant as a project develops.
I would advocate using semantic for package versions, so we can reflect the kind of changes made to a package with a new release. I personally only ever use the first three digits, but no pre-release as I find it just complicates things and never felt a need for it.
The status of the package is not entirely independent from versioning, but provides a high-level description or how 'reliable' the package will be to the user. @TimTaylor and I have described a lifecycle adapted from Rstudio for RECON summarized in this graph:
It is currently the one we use in some projects and on the general airtable database of the tool ecosystem. But we can use another one if we prefer to.
While issues can be broken down into tasks using tick boxes, I find issues are easier to tackle when they don't contain a long list of things to do. Ideally an issue could be closed by a single, not too extensive PR, so it is easier to review, and faster to merge.
I like using github projects to manage releases. The pros I can see are:
I like the idea that the main
branch on github remains functional at all time whilst holding the latest features and bug fixes, and CRAN hosts the 'stable' version. @Bisaloo was mentioning there is a limit to the number and frequency of releases we can make on CRAN. I am not sure if we want to set hard rules about when to submit a release on CRAN, as it may not be entirely related to semantic change. For instance, a patch fixing an important bug may need to go to CRAN urgently.
First releases onto CRAN are a lot harder than further updates. I would advise not only for the package to be bullet proof when initially submitted, but also to have relatively stable API, so we avoid breaking backward compatibility with new updates. In terms of development stage, that would mean the package is probably towards the end of its experimental phase and getting towards 'maturing'.
I think that covers it for me. Thoughts?
The figure added to the
principles.qmd
page is not rendering correctly for me on the website.
Reported by @joshwlambert in #79.
Sorry for missing this while reviewing @Bisaloo!
This stems from the discussion we had with @thibautjombart
We have done mapping exercise of public goods that we can focus on to get started with. These platforms/software provide data on contact tracing and case managements. Access the sheet here.
There are various functionalities to export data on these tools. The purpose of this thread is to kickstart discussion on various approaches we can undertake.
Accepting API keys/other auth credential along with parameters [file type, filters etc] to directly get data into the session
This approach will give the required data with possibly less requirement for cleaning it but auth approached of the websites can change resulting in failures. This approach will be limited to users with internet connectivity.
Accepting exported data files from platforms. All platforms give users the option of exporting the data as Excel files. Here is an example
We have an implied policy throughout the packages to provide thorough credits to contributors. This includes their role, but we also include ORCIDs where possible.
The blueprints do not state any recommendations on ORCID use. Do we want to explicitly recommend the use of ORCIDs throughout the Epiverse packages? This would help assign proper credit and disambiguate authors in the long run.
See also https://epiverse-trace.github.io/blueprints/contribution-acknowledgements.html
It would be easier to navigate the blueprints website if the left sidebar provided a deeper hierarchy. I notice that the table of contents on the right nav bar expands when a page is opened, but I would find it easier if it was all on the left side.
The information on code reviews in the current blueprints document is more high-level and lacks clear guidelines on the Epiverse-TRACE code review principles.
Given Epiverse-TRACE conducts code reviews in a similar process to the Tidyteam, and they happen to have a nice document outlining their principles and ways of working we should link to this.
In addition to linking to the Tidyteam document, I think adding some further points of clarification would be useful. For example, previous confusion over whose responsibility it is to resolve conversations in code reviews. Therefore a few bullet points outlining some common code review tasks that may be ambiguous without explanation.
This issue continues a discussion with @Bisaloo on automating actions that improve the quality of code. This mostly means improving code readability by formatting code according to a style guide. Any thoughts on this, including choice of style guide, are welcome here.
An example of the actions intended is running styler::style_pkg()
: this can be done manually every so often. However, it could also be automated so that each file is styled before any changes to it are committed. This prevents code styling comments from cluttering a commit history.
One way of automating this process is by using git-hooks
. An alternative, or extension, is a tools based on git-hooks
such as pre-commit
. A git hook could include a code formatting step, so that code is properly formatted before comitting.
We updated the pipeline for creating automated blog posts in epiverse-trace/epiverse-trace.github.io#237
This issue tracks whether we've updated the relevant text in the blueprints as previously added in #70
Under the section on code reviews, we mention the use of lintr
and styler
. It might be worthwhile to also mention goodpractice, which basically calls most of the other automations (cover, lintr, etc) in one line of code and provides useful feedback for improving the code.
I can submit a PR to close this.
The new additions to the blueprints have been formulated from discussion taking place in weekly meetings between the research software engineers in Epiverse-TRACE, and therefore I think the contributors in the preface should include those attending those meetings. Alternatively, contributions could go to only those adding chapters to the blueprints, but this should at least be clarified.
Please consider linking the GitHub icon on the website back to this repo as it currently leads to the general GitHub homepage (See
Line 25 in 056101a
The goal of this discussion is to settle on a set of common practices around branches & merges in git/GitHub.
All changes should be done in branches and submitted by pull requests. Even in simple cases, where a review doesn't seem strictly necessary, using pull requests helps the rest of the team to notice the change and to stay up-to-date on the codebase.
- Should we enforce this by protecting the main
branch?
At the same time, each pull request should focus on a single feature. Pull requests modifying multiple aspects of the code or fixing multiple unrelated bugs are usually more difficult and thus longer to review, which doesn't align well with our agile development method.
Pull requests should target the main
branch. We do not believe it is necessary to have an intermediate develop
branch. Because we follow an agile strategy, the stable version on the package corresponds to the latest CRAN release and the development version is the GitHub version.
However, we discourage using branches targeting other branches as this can quickly escalate in difficult to manage conflicts. If you need to add or propose changes to a non-main
branch, you should either use the commit suggestions feature from GitHub. In more complex cases, you should agree with the person who opened the pull request to push directly to their branches.
GitHub currently has 3 pull request merge mechanisms:
The merge commit option is the only option that doesn't rewrite history. As such, it is the preferred option when multiple people commit directly to the same branch without prior concertation.
However, merge commits present other drawbacks, such as creating a heavily non-linear history, which is extremely difficult to read without a client displaying the various branches over time. Notably, it is very difficult to browse & understand a git history with merge commits in GitHub web interface.
For related reasons, merge commits can create flat out unintelligible diffs in pull requests in some cases.
The preference of squash vs simple rebase depends on the quality of the commits in the branches to merge. If the commits are grouped logically, with clear commit messages, they can be rebased & merged directly. If not, then a squash & merge should be preferred.
- Should we remove the options we choose to not use from the list of options offered by GitHub?
- Should we encourage contributors to clean & reorganise their git commit history before merging their branches? This blog post is a detailed explanation of why this can be beneficial but it is important to note that this is not also straightforward as reorganising commits can easily create conflicts if not done with care.
It is also important to note that our choice here also impacts how we should resolve diverging histories locally because the various methods are partially incompatible. For example, it is not possible to rebase & merge a branch that already contains a merge commit. Therefore, if we wish to use "Rebase & merge", we must ensure that collaborators do not create merge commits in their branches.
One common source of unintended merge commits is running git pull
to fetch remote commits when you already have unpushed local commits. A simple way to solve this is to run git pull --rebase
instead of just git pull
. This setting can also be adjusted globally with the following command:
git config --global pull.rebase true
GitHub has released a 4th merge mechanism (currently in private beta): merge queues.
This is meant to address one major issue that agile projects encounter. If your project has multiple concurrent pull requests, as soon as you merge one of them, checks in the others become out of date because they ran against an outdated version of the main
branch. Merge queues solve this issue by creating a queue with all the pull requests you want to merge and running checks against a temporary branch containing all the changes in the queue.
This is likely the mechanism we will want to use in the future but it is currently in beta and detailed documentation is lacking.
The code chunks make steps more visible. The two steps in Full review are visible. However, in the Partial review, we only have one chunk.
In the line below, we have two steps, one within the paragraph and the second in a code chunk
Lines 40 to 47 in 9c74b34
We can split this paragraph, and compare with current output, It can look like this:
A second use of partial code review is reviewing the changes between version releases. More generally, this can be considered as reviewing changes between a chosen branch and an arbitrary commit in the past, but for the purpose of this example we will focus on differences between versions. For this mock example, lets say a new version (v0.3.0) of a package is ready to be released and all the differences to the previously release version (v0.2.0) need to be reviewed. A branch, which we will call v_020
, is created from the commit that is tagged with the v0.2.0 release. To find this commit we can run git show-ref --tags
. This should return each commit SHA with it's associated release tag. Then create a new branch from this commit using git branch v_020 <commit_sha>
(replacing <commit_sha>
with the chosen commit from the previous command). Push this branch with git push origin v_020
.
git show-ref --tags
git branch v_020 <commit_sha>
git push origin v_020
We then want to create a branch from our stable branch (e.g. main
) for the purpose of the review, here we will call it review
.
git branch review
git push origin review
The pull request can now be made from the review
branch to the v_020
branch and will provide the difference between versions.
From the allRSE meeting of 2023-04-12:
Discussed in https://blog.r-hub.io/2022/09/12/r-dependency/.
Something that is not discussed in the blog post: the tidyverse is so omnipresent in the R ecosystem that they will likely force any user to update their R version.
This means that in practice, all potential users will use one of the latest 5 versions so we can just stick with this policy.
There are two things I would suggest to make explicit in the document:
It would be a good idea to standardize the way to credit authors of code that we reuse. I would suggest to simply add a brief comment --just before the line where the reused code was inserted-- saying something like "credits:", "reused from:", or "adapted from:", followed by the url pointing to the original source. That way we can automatically create a document that summarizes all credits to external code
When reusing external code, ensure that their licenses are compatible with ours. For instance, if we chose MIT or a similar license for our projects, it may not be a good idea to reuse GPL code, since we would be compelled to change our license to GPL, too. I general, I would avoid any code with licenses --either open or non non open source-- that are more restrictive than ours
Is your feature request related to a problem? Please describe.
For tutorial contributors, I consume and redirect others to steps on how to rebase
Describe the solution you'd like
Redirect from the CONTRIBUTING file to the blueprints for this step-by-step guideline after the committing branches comic.
Additional context
An alternative I thought was also to redirect to a git-training discussion entry but would be without context.
Currently, users can't access this GitHub repository from the website. It would be good to provide GitHub on the nav bar to alleviate this, providing similar navigation as in our website.
In the last PR (#42) the default merging option is squash and merge. I do not have access to the repo settings, but would be useful to change this to rebase and merge, given past use of rebasing on other PRs.
Add a github issue to convert the blueprint document to html or pdf upon changes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.