• Ian Stuart's picture

    About the Organisation and Repository Identification(ORI)

    Ian Stuart / April 1, 2015
  • The ORI service provides a standalone middleware tool for identifying academic organisations and their institutional repositories. The service is provided by EDINA who run it as a micro-service.

    ORI provides an API to provide access to information on circa 25,000 academic organisations, some 14,000 networks that map to those organisations, and the 3,000 repositories that also relate to those organisations.

    An important characteristic of the ORI service is that it is a database of organisations which may have a list of repositories as an attribute, as opposed to a database of repositories... which have organisations as one of their attributes.

    It is an edited union of multiple authoritative and extant data sources, with queries returning:

    • Names and URLs for organisations (and repositories), including date when URLs held in the database were last checked, and whether they were "alive"
    • Geolocations for organisations (and repositories)
    • Network ranges for organisations
    • Descriptive details for repositories (descriptions, OAI base URLs, SWORD service documents etc)

    The results for repositories can be restricted by repository type(s) and/or accepted content type(s). Results are returned in a number of formats: JSON (default), XML, and simple text

    SPARQL query interface to the data is available, and a full dataset can be downloaded as RDF/XML or Turtle Linked Data.

    ORI Architecture

    The ORI service is built on a range of open source software including: Apache with Mod_perl, PostgresSQL, and Perl. Data on organisations is gathered regularly by the ORI service from a number of sources using custom CGI scripts written in Perl.

    Several documented APIs are provided that support querying of the ORI database by remote applications and the return of data in the requested format: JSON, XML or text: JSON is the default if no format is requested. ORI also provides Linked Data.

    Functional Diagram

    ORI Functional Diagram

    Infrastructure

    Details of the infrastructure used for the ORI development and test environment(s) are held by the project software engineer(s). If you have further questions about the service that are not answered by the guide please contact the EDINA help desk team, or send email direct to: edina@ed.ac.uk .

    To provide resilience and scale for load, the micro service runs across two installations of the ORI system, one at each of EDINA's two data centres, with a load balancer service distributing client traffic across these sites.

    The Dataset

    This section documents the APIs by which clients can query the ORI dataset to retrieve data.

    Extent

    The ORI dataset is a list of [academic] organisations, with details of networks and repositories associated with them.

    The set presently contains data on: 25,000 Organisations, 3000 repositories, and 14,000 networks. There are 30,000 URLs and 54,000 names for these objects, so the set is large and growing all the time.

    Data Returns

    The following data may be returned when the dataset is queried via calls to the APIs.

    Organisation Data

    org_id The ID for the org (can be used in other API calls)
    lat The Latitude held for the organisation
    long The Longitude held for the organisation
    city The city, or physical location, for the organisation
    countrycode The two-letter country-code the organisation is located in (ISO 3166-1 codes)
    identitities A list of names (and URLs) for the organisation (see below for details)

    Data is also pulled in from the identities data. The following are taken from the first identity record:

    org_name
    org_npri
    org_acronym
    org_npref
    org_iri

    From the first matching (else non-matching) URL for the first identity:

    org_url
    org_upri
    org_checked_good
    org_date_checked

    Repository Data

    repo_id The ID for the repository: can be used in other API calls.
    lat The Latitude for the repository.
    long The Longitude for the repository.
    postaddress The address the repository is located at.
    countrycode The two-letter country-code the organisation is located in (ISO 3166-1 codes)
    oaibaseurl The URL for ORI harvesting.
    softwarename What software the repository uses e.g. EPRints, DSpace, flubber, etc.
    softwareversion The version of the repository software.
    description The main description of the repository.
    comment A list of additional comments for the repository.
    types A list of the repository's types: institutional, data, etc.
    content A list of the content types the repository accepts e.g. Pre-prints, data, etc.
    external_ids A list of external IDs e.g. OpenDOAR_123, etc.
    language A list of languages used in the repository interface.
    sword A list of service document locations for the repository.
    identities A list of names and URLs for the organisation. See below for details.

    The following are taken from the first identity record.

    repo_name
    repo_npri
    repo_acronym
    repo_npref
    repo_iri

    From the first matching (else non-matching) URL for the first identity.

    repo_url
    repo_upri
    repo_checked_good
    repo_date_checked

    Network Data

    net_id The ID for the network: can be used in other API calls.
    inetnum The IP range for the network (123.234.0.0-123.234.63.255).
    dec_lower. The first IP number of the range (123.234.0.0, from above).
    dec_upper The last IP number of the range (123.234.63.255, from above)
    identities A list of name(s) for the network: there are no URLS. See below for details.

    The following are taken from the first identity record.

    net_name
    net_npri
    net_acronym
    net_npref
    net_iri

    Identities

    Each entry in the array is a name for the object, with whichever name is defined as "Primary" at the start of the list.

    Each identity object contains the following keys (if they exist in the database):

    name The name of the object ("Poppleton Univeristy", "Plink-Plonk Repository", etc.)
    acronym Any acronym the object may be known as ("PU", "PPR", etc.)
    npref A true/false flag that indicates which is the preferred term. (Absent means true, not false, or "There is no statement that the name is not the preferred term" )
    pri A true/false flag that indicates if the name is marked as Primary. Again, this flag in not always defined, as there may be only one option, or there may be know definite name that is the primary name.
    iri The Open Linked-Data uri to get the linked-data record
    nid The database ID for the name
    urls A sub-element containing URL data for the object, as associated with the particular name.

    URLs

    In the database, there is an association between names and URLs. This is to enable objects to have multi-lingual names, and appropriate urls for each language (eg: Ukranian, Russian, and English)

    The urls element contains two keys: “matching” and “non-matching”, both of which are lists on url objects:

    'urls' => {
                  'matching' => [
                                    {....},
                                    {....}
                                   ],
              'non_matching' => [
                                    {....},
                                    {....}
                                   ]
              }
    

    If a URL is flagged as Primary, it is placed at the front of the appropriate list

    Within each url object, the following data is returned:

    url The actual URL.
    pri Whether the URL is marked as a primary one.
    live A true/false flag to indicate if the URL returns [a non-error] web page
    date The date that the URL was last checked. Note that no history is kept of the alive/not-alive checking. Hosts that are alive are re-checked weekly, hosts that are not flagged as alive are checked on a daily basis.
    uid The database ID for the URL.

    main API

    The primary contact point for calls to the ORI is http://ori.edina.ac.uk/api.

    Data Returns

    All APIs return data in the same ways:

    1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The format options are ‘json’, ‘xml’, or ‘text’, with 'json' being the default if nothing is specified. 
      If there is a callback parameter, and the format is 'json', then a crossDomain package is returned.
    2. All APIs return the data as a nested object, with three top-level elements:
           {
             'message' => {}
             'status'  => 'ok',
             'to'      => 'http://.....'
           }
      	

      status is “ok”or “fail”to is the url that made the query, and message contains the actual data being returned.

    Making Calls

    Currently at http://ori.edina.ac.uk/api? the “locus” for the search can be defined in a number of ways:

    1. You can specify an IP number to base the search on (ip=129.215)
      • If a full quad is not given, then the full range based on what is given is assumed (so 129.215 means 129.215.0.0 to 129.215.255.255)
      • If a range is defined (ie 129.214-129.217) then the upper and lower bands are set accordingly (ie 129.214.0.0-129.217.255.255)
    2. You can specify a geographic location to base the search on (geo=55.95,-3)
      • The accuracy for the search depends on the numbers given: the range is always +/- 1 either side of the last decimal place given (so a bounding box of 55.94,-2.9 to 55.96,-3.1)
    3. You can specifically define an organisation ID to fix your search on (org=2736)

    You can specify multiple locus points, however how they interact needs to be made clear:

    • Every locus definition within the same typeis cumulative: if you specify two IP ranges, then anything in either range is listed.
      • This can lead to lots and lots of results
    • Every locus definition that combines different typesresults in an intersection of the results (all the results on a specified network range that are also within a specified geographic location)
      • This can lead to Zero results

    In addition to defining the locus for the search, the repositories returned can be tuned to return only those of a certain type, and/or only those that accept particular types of deposits.

    • type is the parameter that defines the type of repository (Institutional, Data, etc), and its the code number you need (see the appropriate list/type call for the known list of types
    • content is the parameter that defines the type of content the repository accepts (pre-prints, data, learning objects, etc), and its the code number you need (see the appropriate list/content call for the known list of content-types.

    Returned Data Object

    The data object returned is a set of net objects (indexed by net_id), within which is a list of org objects associated with that network. Within each org object is a list of repo objects. All objects conform to the specification here. The data is not sorted before being returned.

     {
       'message' => {net} => 'i38647' => { 'dec_lower' => '152.78.0.0',
                                           'dec_upper' => '152.78.255.255',
                                           'orgs' => [ { 'org_name' => 'AgentLink.org',
                                                         'org_url' => 'http://www.agentlink.org',
                                                         'repos' => [ { 'repo_name' => 'xxxxxxx',
                                                                        'org_url'   => 'yyyyyy',
                                                                        ................
                                                                       },
                                                                       {
                                                                        .................
                                                                       } ]
                                                         },
                                                         {
                                                     } ],
                                           ...........
                                         }
                          => 'i39677' => {
                                           .............
                                         }
     }
    

    get_xxx API

    This suite of functions was initially created as part of a set of “data sanity checking” web pages and have now been brought in-line with the other functions, and made generic.

    Data returns

    All APIs return data in the same ways:

    1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are jsonxml, or text, with json being the default if nothing is specified. The get_xxx suite also understands the prototype format (see below) 
      If there is a callback parameter, and the format is json, then a crossDomain package is returned.
    2. For prototype returns, the data is formatted as an xhtml unordered list (as per the scriptalicious/prototype requirements), with the for attribute set to match EPrints field names.
    3. For all other returns, the data is a list of data records.

    Making Calls

    This is a suite of three APIs are at http://ori.edina.ac.uk/cgi-bin/get_xxx, and are there to support AJAX calls.

    The basic premis is that the term to be looked up is passed in a parameter q, and all the records that have that term somewhere in the data are returned.

    Additional parameters can be used to tune the query:

    • format - define the format being returned
    • field - specify which field to query on (see the individual functions for more details on this)

    The three queries are:

    get_orgs

    This query will search either the name or the url to return a list of organisaions that match. By default, the name field is searched.

    get_nets

    This query will search either the name or an IP number to return a list of networks that match. By default, the name field is searched, however if the script spots an IP number, it will automatically switch to an IP search.

    get_repos

    This query will search either by name or url to return a list of networks that match. By default, the name field is searched.

    list/xxx API

    These APIs return lists of values, some of which may be used as parameters for the main API calls.

    Data returns

    All APIs return data in the same ways:

    1. You can specify the format either with the Accepts header in the http request, or with the format parameter. The options are ‘json’, ‘xml’, or ‘text’, with ‘json being the default if nothing is specified.
      • If there’s a callback parameter, and the format is json, then a crossDomain package is returned: very useful!
    2. All return the data as a nested object, with three top-level elements:
      •  {
           'message' => {}
           'status'  => 'ok',
           'to'      => 'http://.....'
         } 
         				

    status is ok or failto is the url that made the query, and message contains the actual data being returned, which is dependent upon the query!

    Making Calls

    This is a suite of APIs at http://ori.edina.ac.uk/cgi-bin/list/xxx that pull out lists on the following:

    • type
    • content
    • country
    • lang
    • org
    • net
    • repo
    type

    This lists the type (or classification) of repository. The classification scheme is automatically extended as new types are listed in the up-stream sources.

    To use a repository type with the main API use the code number required e.g. ?type=11

    The count element indicates how many repositories are in the set.

    'message' => {
                   'type' => [
                               {
                                 'code'  => 1,
                                 'count' => 57,
                                 'text'  => 'Subject (Research Cross-Institutional)'
                               },
                               {
                                 'code'  => 2,
                                 'count' => 299,
                                 'text'  => 'Other'
                               },
                               ......
                             ]
                          },
    

    The classification scheme is automatically extended as new types are listed in the up-stream sources, but started as:

    Type Code Descriptive Text
    1 Subject (Research Cross-Institutional)
    2 Other
    3 Disciplinary (Cross-institutional subject repositories)
    4 Journal (e-Journal/Publication)
    5 Database (Database/A&I Index)
    6 Demonstration
    7 Institutional (Institutional or departmental repositories)
    8 Thesis
    9 Undetermined - Repositories whose type has not yet been assessed
    10 Aggregating (Archives aggregating data from several subsidiary repositories)
    11 Learning (Learning and Teaching Objects)
    12 Governmental (Repositories for governmental data)
    13 Theses
    14 Multi
    15 Researchdata
    16 Opendata

    Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

    The repos sub-elements are indexed by repo_id.

    content

    This lists the type of content that repositories accept. The classification scheme is automatically extended as new types are listed in the up-stream sources.

    To use a repository type with the main API use the code number required e.g. ?type=11

    The count element indicates how many repositories are in the set.

      <message>
        <content>
          <code>1</code>
          <count>112</count>
          <text>Research papers (pre- and postprints)</text>
        </content>
        <content>
          <code>2></code>
          <count>86</count>
          <text>Research papers (preprints only)</text>
        </content>
        .....
      <message>
    

    The classification scheme is automatically extended as new types are listed in the up-stream sources, but started as:

    Content Code Descriptive Text
    1 Research papers (pre- and postprints)
    2 Research papers (preprints only)
    3 Research papers (postprints only)
    4 Bibliographic references
    5 Conference and workshop papers
    6 Theses and dissertations
    7 Unpublished reports and working papers
    8 Books & chapters and sections
    9 Datasets
    10 Learning Objects
    11 Multimedia and audio-visual materials
    12 Software
    13 Patents
    14 Other special item types

    Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

    The repos sub-elements are indexed by repo_id.

    lang

    This call lists all the languages the dataset knows about: the ISO 639 codes .

    (We are limited to ISO 639-2 as ISO639-3 and later are not Open Access lists and there is a clause which states “the product, system, or device does not provide a means to redistribute the code set.”)

    The count element indicates how many repositories are in the set.

    {
      "to" : "http://ori.edina.ac.uk/cgi-bin/list/lang",
      "status" : "ok",
      "message" : {
        "lang" : [
          {
            "text" : "Abkhazian",
            "iso3_b" : "abk",
            "count" : 0,
            "code" : "ab"
          },
          {
            "text" : "Achinese",
            "iso3_b" : "ace",
            "count" : 0
          },
        ]
      }
    }
    

    Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

    The repos sub-elements are indexed by repo_id.

    country

    This lists all the countries the dataset knows about: the ISO 3166-1 codes.

    The count element indicates how many repositories are in the set.

    {
      "to" : "http://ori.edina.ac.uk/cgi-bin/list/country",
      "status" : "ok",
      "message" : {
        "country" : [
          {
            "text" : "Andora",
            "count" : 0,
            "code" : "ad"
          },
          {
            "text" : "United Arab Emirates",
            "count" : 0,
            "code" : "ae"
          },
        ]
      }
    }
    

    Adding the parameter full=1 will cause the query to include all the repositories, under a repos element, that are listed as being of that country.

    The repos sub-elements are indexed by repo_id.

    Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Note that repositories are not exclusively one type or another, and may appear under multiple types.

    The repos sub-elements are indexed by repo_id.

    org

    This lists all the organisations in the dataset. This script will take over 15 minutes to complete as there is a large amount of data to return.

    Adding the parameter full=1 will cause the query to return all the repositories that are of that type listed under a repos element. Running the query with the full flag can take twenty minutes!

    The repos sub-elements are, in this situation, listed as described here.

    {
      "to" : "http://ori.edina.ac.uk/cgi-bin/list/org",
      "status" : "ok",
      "message" : {
        "org" : {
          "1" : {
            <as per org listing>
          },
          "4": { 
            <as per org listing>
          },
        ]
      }
    }
    
    net

    This call lists all the network IP range(s) for the organisations in the ORI where these are known. This script will take several minutes to complete as there is a large amount of data to return.

    Adding the parameter full=1 will cause the query to return the details of each organisation that is within each network.

    Results are returned in ascending order of the net_id.

    repo

    This call lists all the repositories in the dataset. This script will take several minutes to complete as there is a large amount of data to return.

    There is no full=1 flag

    Results are returned in ascending order of the repo_id.

    Linked Data Files

    Up-to-date raw linked data files are produced each day and can be retrieved in W3C supported formats at: http://ori.edina.ac.uk/reference/linked/1.0/

    • ori.ttl is the Turtle format file
    • ori_rdf.xml is the RDF/XML format file.

    Further Reading

    Open Archives Initiative

    The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements.

    Glossary of Terms

    Acknowledgement

    ORI was developed by EDINA, with funding from JISC