librarycarpentry / lc-open-refine Goto Github PK

View Code? Open in Web Editor NEW

49.0 18.0 135.0 18.65 MB

Library Carpentry: OpenRefine

Home Page: https://librarycarpentry.org/lc-open-refine/

License: Other

carpentries library-carpentry lesson openrefine data-cleaning data-management english stable

lc-open-refine's Introduction

Maintainers for Library Carpentry: OpenRefine

Jennifer Stubbs (Lead)
Owen Stephens

Past Maintainers for Library Carpentry: OpenRefine

Lesson Maintainers communication is via the team site.

Library Carpentry

Library Carpentry is a software and data skills training programme for people working in library- and information-related roles. It builds on the work of Software Carpentry and Data Carpentry. Library Carpentry is an official Lesson Program of The Carpentries.

License

All Software, Data, and Library Carpentry instructional material is made available under the Creative Commons Attribution license.

Contributing

There are many ways to discuss and contribute to Library Carpentry lessons. Visit the lesson discussion page to learn more. Also see Contributing.

Code of Conduct

All participants should agree to abide by The Carpentries Code of Conduct.

Authors

Library Carpentry is authored and maintained through issues, commits, and pull requests from the community.

Citation

Erin Carillo (Ed.), Owen Stephens (Ed.), Juliane Schneider (Ed.), Paul R. Pival (Ed.), Kristin Lee (Ed.), Carmi Cronje (Ed.), James Baker, Christopher Erdmann, Tim Dennis, mhidas, Daniel Bangert, Evan Williamson, … Jeffrey Oliver. (2019, July). LibraryCarpentry/lc-open-refine: Library Carpentry: OpenRefine, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3266144

Checking and Previewing the Lesson

To check and preview a lesson locally, see http://carpentries.github.io/lesson-example/07-checking/index.html.

lc-open-refine's People

Stargazers

Watchers

Forkers

okstate-maps pitviper6 mialondon cmacdonell stragu jbkieffer kirschbombe abubelinha shlake anelda ucla-data-science-center brownsarahm danielbrett celinergb ingridreiche danielbangert jmjamison rachelwritingcode tom-h coxsh partiecolored bmcgowan01 libcce adivea joshuadull dromito morskyjezek ppival hayesb kaitlinnewson fionaglasgow sm4bes tibhannover samuelhansen katrinleinweber reallinna andreamedinasmith liamodwyer hannahcgunderman mikaelalawrence elliewix jendaub drjwbaker shilowil rootsandberries eunicexc biostew carleneb rdirig bellsm74 sophieelisabeth esalmon79 mervlim21 metadata-research sclayton29 karenword jreeveseyre sakeogh mikeadavidson cataloguelegacies carpentries-i18n gcorinne dipietroc jgiaccai melmeum tadamus haozeke abigailsparling cclaridge swc-kcl-london-03-2021 alexcasper lwhit528 airyaa ahfbacon6 villanueval ej2432 jensnow1 jwatts-pixel gengrubber klbarnes20 emcaulay libcarpenter mahunter14 gbstringer jessicahymers oelker mfeustle riannz curet nachi0310 lahm3d grynoch apandas fsteeg davidfkane mirjamc floresfabiana debalinabarua rcurty saragon02

lc-open-refine's Issues

Describe differences between create project, open project and import project

I noticed that the lesson doesn't have any descriptions for learners on what might be the differences between create a project, open a project and import a project. There is one brief line further down indicating that "open a project" allows you to review your existing projects. Since open, import and create can have similar definitions in other tools it might be useful to spend a little more time here distinguishing these tabs.

Thanks for contributing! If this contribution is for instructor training, please send an email to [email protected] with a link to this contribution so we can record your progress. You’ve completed your contribution step for instructor checkout just by submitting this contribution.

If this issue is about a specific episode within a lesson, please provide its link or filename.

Please keep in mind that lesson maintainers are volunteers and it may be some time before they can respond to your contribution. Although not all contributions can be incorporated into the lesson materials, we appreciate your time and effort to improve the curriculum. If you have any questions about the lesson maintenance process or would like to volunteer your time as a contribution reviewer, please contact The Carpentries Team at [email protected].

Review Episode 12 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 12 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/12-export-transformation.md) to ensure it follows the style guide

Review Episode 2 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 2 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/02-importing-data.md) to ensure it follows the style guide

Include language about how OpenRefine opens on different Operating Systems?

Could there be language included about how the start up process of launching OpenRefine will open terminal window on PCs but not in Macs? (I've not seen OpenRefine launched on a linux machine know if a terminal window occurs)

Create a GREL cheatsheet

Create a cheatsheet-like handout for GREL functions with examples.

This should include at least all GREL operations used in the Lesson
The desired end result is a cheat sheet such as

Resources in the OpenRefine wiki may be helpful in compiling the guide and also may be worth pointing to from the cheat sheet. In particular:

OpenRefine 3.2

OpenRefine 3.2 was released on Friday.

I have done the lesson side by side with 3.1 and 3.2 and cannot see any differences. Even advanced functions (fetching urls from crossref, viaf reconciliation, vib-bits extension) work the same way as before. I suggest to change the "latest tested version" in https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/setup.md to 3.2.

OpenRefine 3.2 introduces some new features (cf. https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-3.2) but in my point of view there are no "game changers" and it is better to maintain the backwards compatibility of the lesson for now.

Add Episode on Dates in OpenRefine

Based on request by @rovinghistorian in #10
@rovinghistorian said:

Would there be time in lesson 10--or would it be better as part of an advanced set of lessons--to discuss handling transforming messy dates? (My use case: a spreadsheet inventory with a column for dates, creators added data like 2/xx/1954 or 19?? in addition to 2/3/1934 and other data that was easier to transform/clean up.)

@ostephens replied

I think a lesson on "Dates in OpenRefine" that could be optionally included would be v useful . Would be interesting to sketch out what could be in that - but would think about including:

Using toDate and toString

Using 'match' to work with messier dates

Date facets

Date specific GREL functions: datePart, diff, inc, now

In Lesson 11 (arrays) the Custom Facet needs to explicitly covert boolean .toString()

Reported by @kristindawn in #30

For OpenRefine 3.1 upwards list/text facets don't automatically convert numbers, dates or booleans to strings for the facet. This means the conversion has to be done explicitly by the user when they want to see (e.g.) booleans listed as true/false rather than just getting a count of the number of boolean values.

Lesson needs updating to use

value.contains(",").toString()

Update Episode 13 to include how to set User-Agent header when looking up data on CrossRef

The CrossRef API Ettiquette https://github.com/CrossRef/rest-api-doc#etiquette says users of the API should:

Specify a User-Agent header that properly identifies your script or tool and that provides a means of contacting you via email using "mailto:". For example: GroovyBib/1.1 (https://example.org/GroovyBib/; mailto:[email protected]) BasedOnFunkyLib/1.4.

We should make setting the User-Agent header when fetching data with OpenRefine part of this lesson

Review Episode 13 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 13 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/13-looking-up-data.md) to ensure it follows the style guide

Episode 9: Transformations - Undo and Redo

Episode 9

To:

If you wish to save a set of steps to be re-applied later, or to a different project, you can click the Extract button. This gives you the option to select the steps you want to save, and to copy the transformations included in the selected steps in a format called ‘JSON’

suggestion add:

You can copy the Extracted JSON and save as a simple text-file (e.g. in Notepad) for later re-use.
(Tips: save with a filename that helps you remember the steps involved, for example
authorSplitANDmassEditPublisher.txt )

Motivation for suggested addition: it is not clear now how to save selected steps, and it also might give the impression that the file has to be saved as xxx.json

Review Episode 7 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 7 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/07-introduction-to-transformations.md) to ensure it follows the style guide

Review Episode 9 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 9 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/09-undo-and-redo.md) to ensure it follows the style guide

Episode 11: Fix minor typo

The second point of the match part of the exercise appears to have a typo:

In the Expression box type value.match(/(.*),(.*)/) The /, means you are using a regular expression inside a GREL expression. The parentheses indicate you are going to match a group of characters. The .\* expression will match any character...

I think the .\* should actually read .*

June 2019 Lesson Release checklist

If your Maintainer team has decided not to participate in the June 2019 lesson release, please close this issue.

To have this lesson included in the 18 June 2019 release, please confirm that the following items are true:

When all checkboxes above are completed, this lesson will be added to the 18 June lesson release. Please leave a comment on carpentries/lesson-infrastructure#26 or contact Erin Becker with questions ([email protected]).

Episode 6: Working with columns and sorting

Kia ora,

As part of the checkout process I propose a minor change to Chapter 6: Working with columns and sorting.

I suggest a minor change of wording relating to re-order vs reorder:

You can re-order the columns by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing Edit columns->Re-order / remove columns …

You can then drag and drop column names to re-order the columns, or remove columns completely if they are not required.

to be replaced by:

You can reorder the columns by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing Edit columns->Re-order / remove columns …

You can then drag and drop column names to reorder the columns, or remove columns completely if they are not required.

Motivation for suggested change:

There are two variations on "reorder" in use on this page, which, while minor, is inconsistent.

It might be worth having a discussion to see which way it goes, ie, the "reorders" could be changed to "re-orders", given that that is how it is spelled in OpenRefine, even though I don't believe that is technically grammatically correct. The remainder of the lesson uses reorder, for what it's worth.

Fiona Glasgow
Research Support Librarian
University of Otago

Improving Column Split exercise for OpenRefine

In preparing for the second day of instructor training, I went through the OpenRefine lesson on faceting using the portal_rodents dataset and found two places where the lesson could maybe improve:

Where you split the Scientific Name column on a space, I got confused with the instructions initially and (not thinking) thought I was supposed add a space in addition to the comma (it was 9:30 pm, my only excuse). I figured it out when it didn’t work (had to use the Undo/Redo screen earlier than the lesson called for) but when I taught it to my partner the next day, that experience made me emphasize that one could split the column on any character, that the default was a comma but you could change that to a space.
Later in the lesson when it called for trying to rename the column, the note says that students won't succeed and prompts them to think about why. I actually did succeed in renaming mine because I capitalized the word Species instead of leaving it lowercase. So the follow up discussion might benefit from asking, not “why didn’t that work”, but instead, “Did you get that to work? Can you think of a reason why it might work versus not?”

Using GREL for names reversing

Lots of discussion on this here:

data-lessons/library-openrefine-DEPRECATED#64

Update setup.md to remove mention of older versions of OpenRefine

In setup.md

Remove mentions of OpenRefine 2.x versions
Remove information about using google-refine.exe which is not valid after v2.5
Add a direct encouragement to upgrade to the tested version

Episode 3: Layout of OpenRefine, Rows vs Records

Hi,
I am Joakim Philipson from Stockholm University. As a new prospect Carpentry instructor, this is my first issue as part of the checkout process:

Episode 3

Suggest a minor change of wording here:

You will now see that split rows have gone away - the Authors have been joined into a single cell with the specified delimiter. Our Rows and Records values will now be the same since we do not have any more ~~split columns~~.

to be replaced by:

... columns with split (multi-valued) cells

Motivation for suggested change:

Reference back to previously used expression, (multi-valued) might be left out
Avoid possible confusion and misunderstanding that columns, not cells, were split before

Review Episode 11 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 11 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/11-using-arrays-transformations.md) to ensure it follows the style guide

Seeing different data in 'Put titles into Title Case'

In Writing Transformations > Put titles into Title Case, I saw something different than what is written in the exercise. I don't see the publisher selections in uppercase. Can someone confirm? As an alternative, we can include an additional step to Edit Cells > Common transforms > To uppercase (to make sure the publisher names are in uppercase)?

Review Episode 6 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 6 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/06-working-with-columns.md) to ensure it follows the style guide

Lesson #04 Faceting and Filtering

Suggested improvements to the text of the Faceting and Filtering Lesson of the Library Carpentry Open Refine course:

• Question > How can I use filters and facets to explore data OpenRefine?
	○ Question > How can I use filters and facets to explore data **in** OpenRefine?
• Facets are one of the most useful features of OpenRefine and can help both get an overview of the data in a project as well as helping you bring more consistency to the data.
	○ Facets are one of the most useful features of OpenRefine and can help **in** both gett**ing** an overview of the data in a project as well as helping you bring more consistency to the data.
• A ‘Facet’ groups all the values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time.
	○ A ‘Facet’ groups all the values that appear in a column, and then allow**s** you to filter the data by these values and edit values across many records at the same time.
• Correct the Language values via a facet > Text facet on the language column and correct the variation in the EN and English values.
	○ Correct the Language values via a facet > **Use the** Text facet on the language column and correct the variation in the EN and English values.

Lesson #4 - Faceting and Filtering

I noticed that this lesson starts with facets, briefly touches on filters, and then returns to facets. For the sake of continuity and grouping like concepts together, I thought the page might benefit from a slight reordering:

Facets
Let's create a text facet
Facet exercise (Which licenses are used for articles in this file?)
More on facets
Facet exercise (Find all publications without a DOI)
Filters
Working with filtered data

I’d also like to suggest changing “More on Facets” to “Going beyond the text facet”.

If this issue is about a specific episode within a lesson, please provide its link or filename.

Confusing to have lowercase contribute.md and CONTRIBUTING.md in repo

Possible look at both files and see if one can be deleted?

Episode 6: Working with columns and sorting - no information on renaming columns

The episode overview, key points and one of the headings all mention renaming a column, but the episode doesn't actually contain any information on how to do this.

I would suggest adding something like this:

You can rename a column by opening the drop-down menu at the top of the column and choosing Edit column -> Rename this column. You will then be prompted to enter the new column name.

Layout of OpenRefine, Rows vs Records - separator example

In https://librarycarpentry.org/lc-open-refine/03-working-with-data/index.html

“### Choosing a good separator
Now imagine that the document creator had chosen a comma as the separator instead of a pipe.

Jones, Andrew , Davis, S.

Can you spot the problem? Can you tell where one author stops and the next begins?”

It seems to me that a better example is possible, because a plausible answer to the question “Can you spot the problem?” would be: ‘No’. To the question “Can you tell where one author stops and the next begins?” my answer would be: ‘Yes, I can: the division between two author names is a blank, followed by a comma.’
My suggestion would be:

“Now imagine that the document creator had chosen a blank as the separator instead of a pipe.

Jones, Andrew Davis, S.

Can you spot the problem? Can you tell where one author stops and the next begins?”

Review Episode 8 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 8 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/08-writing-transformations.md) to ensure it follows the style guide

How do you start OpenRefine?

In this lesson it says:

To import the data for the exercises below, run OpenRefine. NOTE: If OpenRefine does not open in a browser window, open your browser and type the address http://127.0.0.1:3333/ to take you to the OpenRefine interface.

We don't show anywhere how you actually start OpenRefine.

Lesson #3: Rows vs. Records with lengthy metadata files

Good morning!

My name is Madison Chartier. I'm a new hopeful instructor with the Carpentries and look forward to being a part of this group.

I'm presently doing a lot of work in my job with OpenRefine. It's a great tool (I love it!), but I've run across some issues that I was curious whether or not might be beneficial to address or acknowledge in Library Carpentry lessons as potential troubleshooting areas when working with lengthy, messy metadata files on OpenRefine. One issue in particular is the concept of "Rows" and "Records," as discussed in Lesson #3 of the Library Carpentry: OpenRefine session.

The CSV files I often clean on OpenRefine are set up to contain one record per row. However, when uploading more extensive files (say, with 10,000 entries), I've noticed that OpenRefine tends to glitch or misread some of the records, mistaking some rows as being related when they are meant to be their own unique entries. In such instances, the "Records" view winds up being misleading, and I'm often obliged to ignore it. I end up relying exclusively on the "Rows" view, which tends to reflect the more accurate count of entries from the start.

Has anyone else encountered this issue? If so, is there an explanation for it? Might it be something we want/need to address when working on OpenRefine with the particular kind of data we're often obliged to work with in libraries? I'd be curious to know your thoughts.

Thanks!

Regards,
Madison

Review Episode 1 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 1 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/01-introduction.md) to ensure it follows the style guide

Reference page pointer no longer going where originally intended

In this page: https://librarycarpentry.org/lc-open-refine/guide/

The link originally given as:
https://my.datascientistworkbench.com/

Now goes to:
https://labs.cognitiveclass.ai/login

Which doesn't look like it's meant for the same sort of thing.

Review Episode 3 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 3 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/03-working-with-data.md) to ensure it follows the style guide

"Make sorts permanent" no longer an option in open refine 3.1

I don't know if this is true for OpenRefine 3.2 beta, but for OpenRefine 3.1 there is no option in the sort menu to make sorts permanent. Where would be the best place to note this in the Working with columns and sorting section or in the instructor notes?

Add step to Importing Data into OpenRefine episode

In the "Importing Data into OpenRefine" episode, the "Create your first OpenRefine project (using provided data) exercise does not include a step to retitle your project. OpenRefine will automatically name it based on the file name, which may or may not be useful.

I did my teaching demo last week as part of the instructor checkout process, and actually hadn't done the lesson contribution yet, which was great in that it allowed me to go back and notice that this was actually missing from the exercise. The Caprentries' staff member that led my teaching demo session commented on my not renaming the project as I demo'd this lesson, and I had assumed I missed a step only to find it's not there.

I've attached a screenshot with the Project name field circled in blue to distinguish it from the red circles on the current version of the lesson. I'd recommend adding a red circle for this section too, and add a step in the numbered exercise steps. Maybe make this step 7?

Handout creation?

History here: data-lessons/library-openrefine-DEPRECATED#60

When to use OpenRefine

See data-lessons/library-openrefine-DEPRECATED#86 for when to use OR and when not to

New OpenRefine versions in setup.md

The lesson recommends OpenRefine 2.7, but the OpenRefine download page has versions 2.8 and 3.0 only. I'm sure a student could hunt down 2.7, and perhaps update the setup.md link with a download page that has the older version, but it's probably best to keep this lesson updated with the latest versions as they come out.

I can test the lesson in version 3.0 and alert you to any problems I come across.

Review Episode 10 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 10 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/10-data-transformation.md) to ensure it follows the style guide

Update Episode 13 to explain Throttle delay

Episode 13 mentions that "API providers may impose rate limits" (thanks @dromito) but does not show how to respect these limits in OpenRefine by using the "Throttle Delay" setting

We should add an explanation of this since it isn't at all obvious that this is the same as a rate limit from the terminology used in OR.

Review Episode 5 against Carpentries style guide

Carpentries now has a style guide
https://docs.carpentries.org/topic_folders/communications/style-guide.html

Review Episode 5 (https://github.com/LibraryCarpentry/lc-open-refine/blob/gh-pages/_episodes/05-clustering.md) to ensure it follows the style guide

Suggestion for Episode 2 Importing data into OpenRefine

This is a suggested change to a lesson as part of checkout as a new instructor. https://librarycarpentry.org/lc-open-refine/02-importing-data/index.html
It took time to understand the import conditions for numbers 4-6 and it took me a while to find an explanation for number 6. Also, not including “Use character “ to enclose cells containing column separators” did trip me up in the teaching demo. I didn't take notice of this condition while teaching myself OpenRefine.

Current instructions
4. Click in the Character encoding box and set it to UTF-8
5. Ensure the first row is used to create the column headings by checking the box Parse next 1 line(s) as column headers
6. Make sure the Parse cell text into numbers, dates, ... box is not checked, so OpenRefine doesn’t try to automatically detect numbers

Suggested changes to make the above statements specific to the demo dataset.
4. Click in the Character encoding box and set it to UTF-8 as it will ensure the special characters in the Author column are displaying correctly
5. Ensure the first row is used to create the column headings by checking the box Parse next 1 line(s) as column headers
6. OpenRefine will automatically select “Use character “ to enclose cells containing column separators” as this will place data in one cell where the values are enclosed in quotes from the source dataset
7. Make sure the Parse cell text into numbers, dates, ... box is not checked, so OpenRefine doesn’t try to automatically detect numbers as it may cause errors such as confusion between American and British date formats

Thanks

Find issues to resolve for this lesson

This lesson has now been migrated to the Library Carpentry organisation. Work should happen on THIS repo.

All the issues raised for this lesson are still on the old repo at https://github.com/data-lessons/library-openrefine-DEPRECATED/issues

Find issues to resolve there and fix them here.

librarycarpentry / lc-open-refine Goto Github PK

lc-open-refine's Introduction

Maintainers for Library Carpentry: OpenRefine

Past Maintainers for Library Carpentry: OpenRefine

Library Carpentry

License

Contributing

Code of Conduct

Authors

Citation

Checking and Previewing the Lesson

lc-open-refine's People

Stargazers

Watchers

Forkers

lc-open-refine's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs