Download Content Capture_User manual_eMonocot
Transcript
1 Processing published content for eMonocot Scratchpads Content: 1. Introduction 2. Processing taxon descriptions in bulk. a) Worked example 3. Adding individual taxon descriptions 1. Introduction Published taxon descriptions from regional flora accounts, new species descriptions, partial or complete monographs represent a valuable resource in biodiversity informatics. Digitising these for incorporation in a scratchpad is can be a useful initial step in web-taxonomy. Permission to reproduce any published work needs to be sought from the publishers as appropriate. The advantages of carrying out this work are: - it assembles relevant treatments of a taxon in one place - the information is available online as opposed to inaccessible in the specialist literature - the information is searchable because it is stored in a structured database Scratchpads can hold multiple descriptions of the same taxon and the descriptions can be in different languages too. Existing electronic resources: http://www.biodiversitylibrary.org/advsearch http://plants.jstor.org/ (look under Collection for digitised floras) www.eFloras.org www.kew.org/efloras How you process species descriptions depends on the format of the descriptions, access to equipment and software and the number of taxa treated in the manuscript. If there are several descriptions in a manuscript that all follow the same format, it is probably worth processing them in bulk. If there are fewer species descriptions it may be more efficient to copy and paste directly into the scratchpad. If you have an electronic copy of the manuscript, you can ignore the first few steps of the processing described below. Permission to reproduce may still need to be sought. Please read through the whole of the document to get a better understanding of the principles involved at each step. The version of Excel used in screenshots is Excel2007. The principles should be adaptable to earlier. See also scratchpad wiki: http://help.scratchpads.eu/w/Add_and_export_taxon_descriptions, http://help.scratchpads.eu/w/Import. Please get in touch with the Content team or the Scratchpad team if you get stuck. 2. Processing Taxon descriptions in bulk Basic steps: hardcopy the manuscript (book or journal article) PDF Word document UTF-8 encoded Text file Excel file TEMPLATE-import_into_taxon_description. Workflow for Processing Taxon descriptions in bulk Please note that Steps 2 and 5 are explained in more detail in the Worked example below. 2 Processing published content for eMonocot Scratchpads If the manuscript is not available electronically, you need to scan the hardcopy. We are scanning and saving pages individually as PDFs or JPEG images and then combining them into a single PDF file using Adobe Acrobat writer. Scan&OCR 1. The scanned article needs to be converted into a machine-readable document – this process is called OCR (Optical Character Recognition). Adobe Acrobat writer has OCR functionality ( ). Another OCR program (and with a free trial version) is Abbyy FineReader which comes with a user manual. Most modern PDFs are already machine-readable. Edit 2. Copy and paste the text from the PDF into a word document for proofreading and editing. The amount of proofreading depends on how clean the OCR was. Edit the Word document and add delimiters. This involves firstly, proof-reading and editing the text that goes into the scratchpad. Secondly, it involves adding the delimiters (e.g. #) that separate the text for different fields (eg. Taxon name, General description, Habitat, etc.) in the scratchpad. See Worked Example below. Encode 3. Save the final word document as plain text (*.txt), other encoding: Unicode(UTF-8). UTF-8 encoded text file. This is a format that can be imported into an excel sheet. Import 4. In excel, select get external data from a text file and navigate to the text file you made; import choosing ‘delimited’ data type (select next); set the delimiter to ‘Other:’ and add your chosen delimiter (i.e. #) in the provided box (select next), select finish, select the first cell in the sheet for putting the data. The taxon name and corresponding information on General description, Habitat etc in separate fields, display in the Excel sheet. This is then mapped into the scratchpad TEMPLATE. See Worked Example below. Map 5. Download a new copy of the TEMPLATE for taxon descriptions from your scratchpad. Then map the fields from the Excel sheet to the TEMPLATE as described in the Worked example 2a below . 3 Processing published content for eMonocot Scratchpads 2a) Worked example PDF with species descriptions. The text highlighted in blue is copied and pasted into a Word document for editing. Stages in the editing of the text are illustrated below in Box 1 (instructions 1-3) and Box 2 (final version of the text after editing). Step 2. Edit the Word document and add delimiters. Copy and paste the text form the PDF into a Word document and 1. correct spelling mistakes that have resulted from the OCR. In particular, check that the name of the taxon is spelled exactly the same as in the classification tree in your scratchpad (written out in full, with a single space between the genus name and the epithet). 2. delete any text you don’t need. 3. add delimiters between parts of the text that need to be displayed in different fields in the scratchpad (e.g. General Description, Habitat, Phenology). # works well as a field delimiter because this character doesn’t usually get used in the text. Add a different delimiter at the start of each new taxon (‘# works well); this is required to make each taxon a separate record in the Excel sheet later on (see below). # field delimiter ‘# Taxon delimiter If there is a notes section, you may need to split the text up so the information end up in the appropriate fields in the scratchpad. For example, a statement on how the species differs from another species can go in the diagnostic description field, whilst the discussion on the name can go into Taxonomic notes. Distribution and habitat are often treated together. It’s best to split the statements relating to distribution from those on habitat. 4. complete a number of find and replace actions as listed in table 1. Table 1: Find and replace actions. The first three replacements need to be completed in the order they are listed below. The next two are optional replacements. Find what: ^p (paragraph mark) Double space ‘# Replace with: Space Note This is replacing paragraph marks with a space. Single space ^p’# x × - (hyphen) – (En dash) Repeat until there are no more replacements done. This makes every taxon into a different paragraph. When the final text file is imported into excel this will be a different record. e.g. 2 – 4.x.7 – 14. Tip: Include a space before and after the x and replace it with a space before and after the special character. e.g. 2-7. Ranges are commonly indicated by an en dash: 2 – 7. 4 Processing published content for eMonocot Scratchpads 66. ‘#Babiana scabrifolia Brehm ex Klatt in Abhandlungen der Naturforschenden Gesellschaft zu Halle 15: 349 (Erganzungen 15) (1882); G.J.Lewis: 68 (1959). Type: #South Africa, [Western Cape], Clanwilliam District, Langevallei, July 1830, JF Drege 2623 [K, l!;:cto., designated by Lewis, 1959: 68; P, iso. (two sheets)]. B. scabrifolia var. acuminata G.J.Lewis: 70 (1959). Type: South Afri ca, [Western Cape], Olifants River Valley at Bulshoek Barrage, 2 August 1950, G.J. Lewis 2207 (SAM, bolo.; SAM, iso.). B. scabrifolia var. dec/ina/a G.J.Lewis: 71 (1959). Type: South Africa, [Western Cape], between Brandewyn River and Doorn River, in sa nd, 25 August 1950, G.J. Lewis 2192 (SAM, bolo. ). See Lewis ( 1959) for complete synonymy. #Plants 50- 150 mm high including leaves, with a thick collar of fibres around stem base, ± acaulescent, often forming tufts, simple or branched at ground level, hairy above ground level. Leaves lanceo1ate to oblong, exceeding stem, 60-100 x 6- 20 mm, usually inclined toward ground, soft-textured and lightly pleated, shorthairy or nearly smooth, narrow and twisted in young plants. Bracts 20- 32 mm long, green, sparsely hairy, the inner slightly shorter than the outer, divided to base (rarely shortly above base), with wide transparent margins. Flowers zygomorphic, 4-8 in a dense, inclined to horizontal spike, mostly pale blue to pale lilac, lower lateral tepals with broad white to creamcoloured splashes in upper hal f outlined in dark blue to violet, sweetly scented, often of narcissus; perianth tube narrowly funnel-shaped, 12- 18 mm long; tepa Is unequal, dorsal 30-45 x 7-10 mm, lower tepals 20-30 mm long, joined to upper laterals for up to 4 mm and to one another for ± 4 mm forming a prominent lip, margins of lower laterals undulate, sli ghtly crisped in lower half. Stamens 75 unilateral; filaments arched, 13- 18 rnm long; anthers 6-8 mm long; pollen white. Ovary thinly hairy above or on ribs, rare ly smooth; style dividing between middle and apex of anthers, branches 4-5 mm long, expanded at tips. Flowering time: #June to August. Plate 58. Distribution and ecology: #Western Cape: in the Olifants River Valley and lower slopes of the surrounding mountains; #stony soils in dry fynbos or karroid scrub (Map 36). # Hardly differing in its flowers from several other species of section Babiana, B. scabrifolia is most easily recogni zed by the underground stem, often branched at ground level, and usually forming small tufts. It reproduces amply by cormlets and it is common to see mature flowering plants with their broad, soft-textured leaves surrounded by the linear, twisted to coiled leaves produced by immature corms. The flowers are relatively large, and although ove1topped by the leaves, make an attractive sight in the Olifants River Valley and surrounding bills in early spring. The flowers have a pleasing sweet scent and contrast both in the scent and large size from those of B. mucronata, which is also common in the Olifants River Valley. That species typically has an erect, usually well-developed aerial stem, whereas the flowers have a densely hairy ovary and produce a rather harsh acrid-spicy odour (sometimes described as flea powder), and does not have young plants with narrow, twisted to coiled immature leaves surrounding the parent plant. ‘#next species #TYPE INFO#descriptions new sp 1 #flowering time sp. 1 #distribution sp 1 # #differs from sp. 2. ‘#next species2 #Type #description # # #habitat and no text for flowering time and distribution #differs from sp. 1 Box 1: Text copied from a PDF file, showing the workings for the first three instructions listed above. Deletions are in ‘strikethrough’. Circled in green are the headers of different types of additional information about the species. These need to be deleted too since the text is assigned to the corresponding field in the scratchpad during mapping (step 5) at the end of the work flow. Words that need to be corrected because of incorrect OCR are highlighted in yellow. See finished text in Box 2. 5 Processing published content for eMonocot Scratchpads Box 2: Text from the PDF file above with all the editing complete. Note the corrections and the character replacements compared to Box 1. Exercise: copy and paste the text in the box into a new word document and save it as a plain text file encoded as Unicode (UTF-8). Then import it into an excel file (see instructions under step 4 above) and notice how the text has spread across different fields. Note the necessity for multiple delimiters if there is no text for some of the fields. ‘#Babiana scabrifolia #South Africa, [Western Cape], Clanwilliam District, Langevallei, July 1830, JF Drege 2623 [K, lecto., designated by Lewis, 1959: 68; P, iso. (two sheets)]. #Plants 50 – 150 mm high including leaves, with a thick collar of fibres around stem base, ± acaulescent, often forming tufts, simple or branched at ground level, hairy above ground level. Leaves lanceo1ate to oblong, exceeding stem, 60 – 100 × 6 – 20 mm, usually inclined toward ground, soft-textured and lightly pleated, short-hairy or nearly smooth, narrow and twisted in young plants. Bracts 20 – 32 mm long, green, sparsely hairy, the inner slightly shorter than the outer, divided to base (rarely shortly above base), with wide transparent margins. Flowers zygomorphic, 4 – 8 in a dense, inclined to horizontal spike, mostly pale blue to pale lilac, lower lateral tepals with broad white to cream-coloured splashes in upper hal f outlined in dark blue to violet, sweetly scented, often of narcissus; perianth tube narrowly funnel-shaped, 12 – 18 mm long; tepals unequal, dorsal 30 – 45 × 7 – 10 mm, lower tepals 20 – 30 mm long, joined to upper laterals for up to 4 mm and to one another for ± 4 mm forming a prominent lip, margins of lower laterals undulate, slightly crisped in lower half. Stamens unilateral; filaments arched, 13 – 18 mm long; anthers 6-8 mm long; pollen white. Ovary thinly hairy above or on ribs, rare ly smooth; style dividing between middle and apex of anthers, branches 4 – 5 mm long, expanded at tips. #June to August. #Western Cape: in the Olifants River Valley and lower slopes of the surrounding mountains #stony soils in dry fynbos or karroid scrub. #Hardly differing in its flowers from several other species of section Babiana, B. scabrifolia is most easily recognized by the underground stem, often branched at ground level, and usually forming small tufts. It reproduces amply by cormlets and it is common to see mature flowering plants with their broad, soft-textured leaves surrounded by the linear, twisted to coiled leaves produced by immature corms. The flowers are relatively large, and although overtopped by the leaves, make an attractive sight in the Olifants River Valley and surrounding bills in early spring. The flowers have a pleasing sweet scent and contrast both in the scent and large size from those of B. mucronata, which is also common in the Olifants River Valley. That species typically has an erect, usually well-developed aerial stem, whereas the flowers have a densely hairy ovary and produce a rather harsh acrid-spicy odour (sometimes described as flea powder), and does not have young plants with narrow, twisted to coiled immature leaves surrounding Excel sheet for mapping into the TEMPLATE. the parent plant. ‘#next species #TYPE INFO#descriptions new sp 1 #flowering time sp. 1 #distribution sp 1 # #differs from sp. 2. ‘#next species2 #Type #description # # #habitat and no text for flowering time and distribution #differs from sp. 1 Step 5: Excel sheet for mapping into TEMPLATE The Excel sheet you get should look similar to this (add a row at the top to say what field the data is): 6 Processing published content for eMonocot Scratchpads You need to look through the data in the Excel sheet carefully and make sure that all the data is in the appropriate column. Failing to do so will mean that data ends up in the wrong field in the scratchpad. Tips: sort the data in the different fields in alphabetical order. This may help to quickly spot any anomalies, especially if there is data missing that you would normally expect to be there. If that is the case, look in the adjacent fields to the left (all of them) to see if you can find the data there – there may be a delimiter missing in the text file. Make any necessary corrections in the text file and save it. Then import the text file into Excel again. That will be the final Excel sheet to use for mapping. Download a fresh copy of the taxon description TEMPLATE from your scratchpad: Four things to notice: 1. the two sheets in the document one called ‘Template’ and the other one called ‘Permitted Values. 2. the number of fields (or headings) in the Template worksheet, most of which you don’t need. 3. the Taxonomic names field and the reference fields are set up to only allow the values listed in the Permitted Values worksheet. There are only permitted values for the references if there are fewer than 700 records in the bibliography. If there are 4. The reference fields as you scroll to the right in the Template sheet. Each field that can have content from the published literature needs to be referenced. There are two types of reference fields: (NID) and (Title). They correspond to a number called node ID or to the title of the bibliographic record that references the content. You will need to enter a value in one of the reference fields for each field with content (see step e below). 7 Processing published content for eMonocot Scratchpads Then work through the steps below: a. Add your Excel document to the TEMPLATE as a new worksheet: click on ‘Insert Worksheet’ next to Permitted Values sheet (this should add a sheet called Sheet1); copy the worksheet in your Excel document and paste it into the new sheet in the TEMPLATE. b. Add a new column next to the species names. Call the new column ‘Names for import’ and populate it with the names in the existing Names column using the function: =TRIM(text). Use Auto fill to copy the function to all cells in the column. This function gets rid of superfluous spaces before or after the name. This is indispensible because a space at the start or the end of a taxon name would cause the scratchpad to identify it as a new addition to the classification. (see screenshot below where A2 returns the name in B2). c. In Sheet1, select and copy all the names in the ‘Names for import’ column you made earlier. Switch to the Template sheet and right click into the field B2 (Just below the field entitled Taxonomic Name). Select Paste Special. This opens up a menu. Select ‘Values’ under Paste and click ok. This pasts the names of the taxa rather than the =TRIM(text) function. 8 Processing published content for eMonocot Scratchpads The result should look like this: d. Next, map the remaining fields from Sheet1 to the Template worksheet using the =TRIM(text) function. Compare the next two screenshots: in P2 of the Template worksheet, the formula refers back to D2 in Sheet1 and returns the text without any superfluous spaces. The reason for this has to do with adding the references for records that have missing data. 9 Processing published content for eMonocot Scratchpads e. Now you can add the values in one of the reference fields in the Template worksheet. [If you don’t already have a bibliographic record for the publication from which you got the species description, you need to create one in the scratchpad.] Look up the Title or the node ID of the bibliographic record in the scratchpad or start typing the Title of the reference in the Title reference field and it should look up the reference in the Permitted Values drop-down. I will use the Title in the example below. NID Title 10 Processing published content for eMonocot Scratchpads Only the fields that contain data must be referenced; so empty fields should not have a reference. In our experience, the General description field is the field you are most likely to have content for all the taxa in the worksheet and all the other content most likely comes from the same source. If this is the case, we suggest referencing the General description field first. Copy and paste the Title from the Bibliographic record into the General description reference (Title) field*. Then apply the =IF([logical text], [value if true], [value if false]) function to populate the reference fields for the other content*. This avoids adding a reference for an empty content field. In our example, our function for the Distribution Reference (Title) field would be: =IF(M2="","",BS2) where M2 is the Distribution field and BS2 is the Reference Title field for General Description. The formula reads: if the Distribution field M2 is empty, please leave this field empty. If it is not empty, please fill it with what’s in the General description Reference (Title) field, BS2. *If you get an error message like this it’s because the reference you’ve tried use is not listed in the Permitted Values worksheet or because the validation rules don’t allow functions. I suggest you change the validation rules as detailed below. Highlight all the reference columns, select ‘Data Validation’ under the Data tab, say yes to ‘Erase current settings and continue?’, in the next window select clear all and press OK. Now try again. to 11 Processing published content for eMonocot Scratchpads f. Now you need to replace any formulas in the worksheet with the actual text and finally, the file that you save and that you use in the import needs to only have the two worksheets ‘Template’ and ‘Permitted values’. Download a new TEMPLATE again (in particular if you’ve had to clear all data validation settings), select the all the records in the first Template worksheet, copy, paste into the new Template worksheet using the paste special: Values option. Save file. This file is now ready for import. Remember to get in touch with the Content team or the Scratchpad team if you get stuck. 12 Processing published content for eMonocot Scratchpads 3. Adding individual taxon descriptions There are several different ways for copying and pasting taxon descriptions and associated information into the scratchpads. You may need to play around and decide for yourself which one works best for you, depending on whether you want to retain formatting in the text (e.g. Paste from Word). Within the text you copy over, you need to delete references to figures and maps. Each text field has a corresponding reference field, usually right below the text field. The reference fields are autocompleting (like the Taxonomic Name field) but sometimes it’s difficult to find the reference from the drop-down list. If that’s the case, search for the bibliographic record in a new tab and get the NID (see screenshot under Worked example above, step e). Paste the NID in the following format: [nid:###] (i.e. [nid:189] in the example above). In the top bar, select Content tab. In the window that opens find Taxon Description and select Add on the right hand side. In the Create Taxon Description window (screen shot below), the fields are organised in different vertical tabs. Overview tab includes the Taxonomic Name field and the General Description field. Type out part of the name of the taxon in the Taxonomic Name’s field. A drop-down list should appear with name options available from the Classification. Select the name you want. Paste the species description directly into the General description field. 13 Processing published content for eMonocot Scratchpads Start typing the name and wait for the auto-complete. Choose name from the auto-complete list. Vertical tabs Method 1: Paste text straight in the box. Preview and then Save. Notice the formatting of the text in the next screenshot. Add the reference for the text. 14 Processing published content for eMonocot Scratchpads The text has lost all formatting. To preserve formatting, see Method 2 in the next screenshot. 15 Processing published content for eMonocot Scratchpads Method 2: In the Create Taxon Description window, in the field, click on the ‘Paste from Word’ icon. Paste the text in the window that opens. Notice the formatting in the text. Delete any text you don’t need. Preview 16 Processing published content for eMonocot Scratchpads This is the preview of the preceding screenshot. Notice the formatting in the text. Now Save. 17 Processing published content for eMonocot Scratchpads The formatting has been preserved. 18 Processing published content for eMonocot Scratchpads Method 3: If your text comes from an older PDF, with paragraph breaks at the end of each line, use this method: Switch to rich text editor (all the icons at the top of the box disappear) and paste your text in the box. Preview (see next screenshot). This method does not retain formatting in the text. 19 Processing published content for eMonocot Scratchpads In the preview stage, the original paragraph marks show, but they are no longer present in the actual text box. Save. The paragraph marks will not be retained. Add paragraphs: in the rich text editor: press enter as you would in Word. in the plain text editor: put a<br> where the paragraph needs to break. 20 Processing published content for eMonocot Scratchpads There is no formatting in the text. You can use html tags in the text you paste into the plain text editor. 21 Processing published content for eMonocot Scratchpads Remember to get in touch with the Content team or the Scratchpad team if you get stuck.