• Dave Elsmore's picture

    Content Analysis with Python: 1

    Dave Elsmore / March 23, 2015
  • Keywords in Context

    Very often we want to see a keyword in context (KWIC), e.g. the word and a number of  words on either side.

    The following text is taken from the EDINA Community Report 2014.

    This Community Report sets out what EDINA does. EDINA develops and delivers world-class online services and expertise that benefit research and education in the UK and beyond. Our work is a significant part of the contribution that Jisc makes as a champion for the use of digital technologies in Higher and Further Education and in the skills sector. We seek ways to assist Jisc member organisations to succeed more effectively in their mission to improve outcome and increase impact within limited budgets. For researchers, students and their teachers this means enhancing their productivity with services that both inspire and save time, helping to make the imagined possible!

    What you find written here complements other summaries of our activity and the services we deliver, as found on the Jisc website, on the EDINA website, and in the EDINA Annual Review, which forms part of our formal accountability to Jisc and its stakeholders. The uptake and use of our services continues to grow, as it has done consistently since 1995/96 when EDINA first began its part, leveraging value from the University of Edinburgh in which we are based for the wider UK academic community.

    With the kwic.py script we will be able to type:

    python kwic.py edina.txt services 3

    and get back all the instances of the word ‘services’ in the text file.
    When it’s working, output will arrive in the console, and should look like this:

    delivers world-class online [services] and expertise that   
    their productivity with [services] that both inspire        
    activity and the [services] we deliver, as                  
    use of our [services] continues to grow,

    The script is reproduced below.

    import sys, string, re
    # command line arguments
    file = sys.argv[1]
    target = sys.argv[2]
    window = int(sys.argv[3])
    a = open(file)
    text = a.read() 
    tokens = text.split() # split on whitespace
    keyword = re.compile(target, re.IGNORECASE)
    for index in range( len(tokens) ):
        if keyword.match( tokens[index] ):
            start = max(0, index-window)
            finish = min(len(tokens), index+window+1)
            lhs = string.join( tokens[start:index] )
            rhs = string.join( tokens[index+1:finish] )
            print "%s [%s] %s" % (lhs, tokens[index], rhs)