Corpus and the Principles of Good Design

design

Image courtesy of With Associates

It isn’t a secret that I am not enamoured with the use of corpora in the language classroom. Don’t get me wrong, I love the idea and I do use them from time to time, but my beef is with how they are designed. It’s as if the people who created it could care less about design and are only concerned with the output. Whenever I gripe about this, there are always a few to defend it saying that they are able to make it work for them. The problem for me is that I don’t want to just ‘make it work’, I want it to be almost seamless starting from the first use. I decided to apply Dieter Rams’ ten principles of good design to the current design of corpora, and then seeing what could be done to instigate change. I am not a programmer, so these ideas are just being put out there as a request to those who are able to make change.

Good design is innovative: Innovation is not just about change. It is more about approaching something from a new angle, envisioning something in the light of things changing around it. As technology advances, we can see a product in the light of new possibilities, new users.

In the case of corpora, not much has changed in the past 15-20 years other than the access (internet) and databases (larger, more nuanced). For most, the interface looks like it hasn’t left the 90s or is so overly complicated that the average user has a difficult time figuring out what to do with it all.

I would love to see some fresh eyes and minds added to the design process here. I have some ideas of where this could go, but if we put our collective minds together, I believe we could really make some serious headway in the area of innovation. Here are some areas I think we could work on:

Data collection. Instead of relying on a static database that needs periodic updates, what about making it more organic, gathering data in real time? Better yet, collection could be done through crowdsourcing.

Input. Right now, a person needs to enter a word or phrase in a text box and then sift through all of the results. We could harness the power of voice recognition, listening for prosody clues and matching that up to audio data instead of plain text. Yes, this would require a great deal more processing power, but this is something that could overcome. Just look at Siri as an example.

Fuzzy logic. This used to be all the rage for a while. I even had a rice cooker with this function. I’m not sure what it meant in that context, but in most, it takes a wider interpretation of the input and uses logic to figure out what you may need from the clues you have given. In this case, you could enter in a partial sentence and it could produce lexical outcomes that generally match your context.

Questions. What if the interface asked you input questions instead of using radio buttons and vague descriptors. It could ask questions such as, “Do you want to find words that describe _[insert word you entered]_?” For English language learners, asking questions of purpose instead of relying on them to understand the descriptors would be much easier for them to comprehend.

Divide things up. Instead of having everything on one page, divide into modules. If you want to get more, you can ask for more information and it will move from module to module. In relation to that, have a different interface for simple entry and more advanced.

Integration with other apps. What if it could harness the power of other apps such as Twitter, Facebook, or Google Drive? You could then access the content directly from your other app instead of having to go to the corpus page to enter it.

Good design makes a product useful: When I hear the word useful, I automatically connect it to the user. In this case, corpora have been so focused on linguistics users, that the broader audience of language students has been almost completely pushed aside. We need to think like a student and what they want out of it. Some of my students have found a corpora useful, but others feel it doesn’t give the information they are looking for. We should sit down with the users and figure out what they want out of it. Dieter also mentioned in this area that nothing in the design should detract from the usefulness. I think there are a number of detractors in the corpora I have used. Let’s remove them or at least keep them out of the way from the average user.

Good design is aesthetic: There is nothing wrong with making something look nice. I believe it shows that you truly care about your product and the people that are using it. It personalizes the product and makes it more comfortable for users. In this case, I would love to see corpora take on a more modern look with a conscious effort to fit in with modern usages such as mobile devices.

Good design makes a product understandable: I don’t remember where I read this, but the mechanisms used on doors are designed in such a way that we know what is required to make it work. A horizontal bar means that we are able to push it open, where as a vertical handle is designed to be pulled. We don’t even need to think about it. As we approach the door, we know what to do and which way the door is going to open even before we reach it. The purpose of the product is self-explanatory.

 This is not the case with corpora. For the most part, we need to show people how to use it and demonstrate its usage. Most students have no idea what it is used for, even after giving them a short introduction. It isn’t until they use it a few times that it starts to make sense. If we could design the corpus to be more intuitive and make its purpose more transparent, I think we will see a major spike in usage.

 It also should borrow design elements from other products that we are familiar with. I use the example of the online classroom app, Edmodo. When a student goes there for the first time, they immediately see it as familiar as it looks and works very much like Facebook. In no time at all, students are able to get done to work focusing on the content instead of the usage. This is where we need to be with corpora.

Good design should be unobtrusive: There should be some room in the design for users to make it their own. They should be able to make it fit their usage instead of the other way around. The interface should be simple, not dominating. It is about the results, not the tool. This sounds contradictory to what I have said earlier, but it isn’t. If you are fighting to work with the interface, your energy is poured into making it work, instead of being a seamless transition from input to result.

Good design is honest: We need to be careful not to oversell the corpora and what is can do. In the end, it still requires a bit of understanding in how to get to the results you need. We need to make sure to strip down the corpus into discrete objectives, making it more honest in what it is able to accomplish.

Good design is long-lasting: The best products stand the test of time. A comparable product to the corpus is the dictionary. The dictionary hasn’t made major changes throughout its life. Any changes have build off of the core product by adapting to the needs of the users and the changes in technology.

 In the case of a corpus, we need to consider the architecture. Building a corpus on a structure that is heavily dependent on one technology is dangerous. An example of that is Adobe Flash. Who could have foreseen the original growth and the subsequent fall in usage? By being platform agnostic, a database can be moved from one architecture to another with relative ease. Flexibility is the key here. Even the database itself needs to allow for a natural evolution in usage and language.

Good design is thorough down to the last detail: Dieter goes on to say that nothing should be left to chance. Don’t assume users will be familiar with the interface. It should provide plenty of assistance and give samples, usage ideas, and possibly testimonials.

Good design is environmentally friendly: While a corpus is not a physical object, there are some ways that it can be eco-friendly through the limit on bandwidth (server energy costs) such as by limiting graphic use and not using power hungry interfaces such as Adobe Flash. Also, if we think about environment in the more general sense of where something is, a corpus should be situated within the network in such as way that it doesn’t impose on others. Tight integration with other programs help situate it within the network as opposed to fighting against it.

Good design is as little design as possible: Once again, a corpus shouldn’t try to do too much. It should divide itself up into focused segments or modules that can be connected or pulled apart depending on the usage.

What do you think? What could be done to make a corpus more user-friendly and practical? How could a corpus be re-envisioned for the modern age? These are just some of my thoughts, it is now your turn.

Corpora and Collocations

word and phrase

At the last BCTEAL Conference in May, a colleague of mine gave an interesting talk on collocations and made mention of the use of some websites to help students understand what words normally go together. After the session, I was talking with another teacher about the lack of really easy to use corpus tools for students. It appears to me that most corpora are designed for researchers and are way too complex for the average teacher or student to use. There are a few tools that are not too bad, but for the most part, they are a mess visually and in their usage. Maybe corpus designers feel they need to add as many options as possible to satisfy the academic community who typically use it.

I did a little research after the fact and was either directed to or managed to find a few tools that may be useful for students and teachers who are interested in locating collocates of English words. In case you are not sure what any of this means, I thought a little primer on corpora might be in order. For those who understand them better than I do, my apologies for possibly oversimplifying what they are and how they work. My goal here is to provide a simple overview.

What is a corpus?

Simply put, a corpus is a text database. There is no size limit on a corpus, but the larger the corpus, the chances of a more accurate result increases. Large corpora (plural for corpus) usually have millions of words which have been added from hundreds of thousands of documents and transcripts. For example, the British National Corpus (BNC) is made of a incredible amount of documents resulting in a 100 million word database.

What kind of corpora are there?

There are corpora based on spoken speech taken from things such as television, interviews, radio, and other recordings. There are also academic, news, and literature databases just to name a few. It is also possible to create your own using texts, although the sample size is fairly small.

How are they used?

The original corpora were used by publishers and researchers to determine common language usage in publications and language studies. Dictionaries, textbooks, and other coursebooks make heavy use of corpora to determine their content. Researchers have used corpora for cross-cultural language use studies such as comparing essays written by students in one country versus another. This helps in understanding language usage in various contexts to assist others such as teachers in the classroom.

Currently, corpora usage has been extended to the average person such as the teacher in the classroom or even the language student directly. Tools like those listed below help students and teachers to better understand how English is put together in various genres and situations, such as word collocates (words that normally go together) and position in the sentence.

Collocation Tools


COCA

COCA (Corpus of Contemporary American English): This is an excellent corpus, but not the easiest to navigate for collocations. Being that it uses current American English, this database sets it apart from most of the others listed here. Here is a simple way to get collocations:

  • Go to Coca and type your word in ‘Word(s)’ box.

COCA 1

  • Click on the ‘Collocates’ link just below the ‘Word(s)’ box.
  • Click on the ‘Search’ button.
  • A list will appear on the right in order of collocation frequency (the number of collocates with your keyword is listed to the right under ‘Freq’). Click on any of the words and a list of sentences will appear below.

COCA 2


Lextutor

Lextutor Concordance: This is not one of the prettiest sites you will ever find, nor is it that easy to navigate, but it is pretty powerful. The collocation function is somewhat limited, but still useful. Here is a simply way to get a list of collocations:

  • Go to Lextutor Concordances and type your word in the box next to ‘Keywords’ and ‘equals’.

Lextutor 1

  • Click on ‘Get concordance’.

Lextutor 2

  • You will get a short list of sentences listed in alphabetical order of the words directly to the left of your keyword. You can change that at the top of the page in the ‘sort’ drop-down menus.

Lextutor 3

  • Scroll to the bottom of the page to get your short list of collocates.

Lextutor 4


JTW

Just the Word (JTW): This is a popular tool with language teachers and students and for good reason. Out of the most used collocation tools, this is one of the easiest to navigate, although it is a bit limiting. It is based on the BNC, so the results are decidedly British (i.e. the collocations may be different than in North American English). Here is how it works:

  • Go to JTW and type your word in the ‘Enter a word or short phrase’ box and click on ‘Combinations’.

JTW 1

  • You will get a list of collocations divided by ‘clusters’. These clusters are related to the meaning of the word and the word type. You will also see a green line showing how often these word combinations are found together.

JTW 2

  • Click on any of the word combinations and you will get a list of the sentences with that combination.

JTW 3


Collection

Corpora Collection: This is a collection of some of the open corpora including the BNC, Brown, and Reuters. You can change which corpus you use and can get a list of words that collocate with your keyword in that database. Here is a simple use of this site:

  • Go to the Corpora Collection site and type your keyword into the box at the top of the page.

Collection 1

  • Click on the button next to ‘Collocations’ about halfway down the page.

Collection 2

  • Click on ‘Submit’ at the top of the page.

Collection 3

  • You will get a list of collocations in order by score from most to least.

Collection 4


Word

Word and Phrase: This site has a number of tools, but I just wanted to focus on collocation tools for students and teachers. This site is another of those that has lots of functions, but the tools are complex or not necessary for students. Here is how you can create a simple collocations list:

  • Go to Word and Phrase and click on ‘Frequency list’.

Word 1

  • Type your word in the ‘Word’ box and click on ‘Search’

Word 2

  • You will get a list on the right-hand side listed by parts of speech (PoS). Click on the PoS that you would like to see and a list of sentences will be displayed below.

Word 3

  • The collocations are listed alphabetically by those to the right of the word.

Word 4


Skell

SkELL: This site is based on the Sketch Engine which is used by a number of other sites. It uses a cross-section of texts. It is also very simple to use and offers something a little different. Here is how it works:

  • Go to SkELL and type your word in the box at the top of the page.

Skell 1

  • Click on ‘Word Sketch’ and a list of words under word type categories appears below. Click on one of the words listed below to get a list of sentences using that word combination.

Skell 2


Flax

Flax Learning Collocation: This is easily one of the simplest and also nicest of all of the collocation sites. Thanks to Mura Nava who kindly pointed me in the direction of this site during one of my corpus rants on Twitter, I now have a site I can comfortably send my students to knowing they won’t need a lot of hand holding through the process. Here is how it works:

  • Go to Flax Learning Collocations and type your word into the box at the top of the page and click on ‘go  (you can also choose a different corpus from the drop-down menu to the left of ‘go’ for clicking on it).

Flax 1

  • You find a nice list of collocation broken down by usage and a number beside each collocation. This is how often it is found in the database.

Flax 2

  • Click on any of the collocation and you will get a new list showing the variations of that collocation. Click on any of those and you will get a list of sample sentences using that combination.

Flax 3

Let me know what you think. Do you have any to add? How do you use corpora in your classroom? Share you ideas, thoughts, and comments below. Thank you!

Using archived TV news broadcasts in the English language classroom

Screen Shot 2014-08-23 at 11.57.05 PM

I was visiting the Internet Archive the other day for my class and came across the TV News archive. This is a searchable database of over 437,000 TV news broadcast video transcripts and the corresponding videos with them. I think this has a number of uses in English language learning. Here is how it works:

Steps:

  1. Go to the TV News Archive of the Internet Archive website.
  2. Type in a word or phrase search in the search box labelled ‘Search captions through 24 hours ago’ and click on ‘Search’.
  3. A series of videos with their transcripts will appear along the bottom of the screen. The first video will start immediately, so you may need to click on the pause button. All the videos will start a little before the word or phrase that you have searched for and will continue to play for a total of approximately 30 seconds.
  4. After the video has finished playing, it will automatically move on to the next video or until you pause it. If you would like to see a larger version of the video, click on ‘More/Borrow’ at the top of the small video. This will give you a larger version of the video along with a ‘Share’ button to get the video’s URL. Close the large video mode by clicking on the red X in the top-right corner of the video. Use the back arrows of you browser to go back to the search results.
  5. In the search results area, you can scroll right and left to see more videos and transcripts.
Here are some ideas on how this could be used with English language learners:
  1. Students could search for word collocations such as searching for ‘take’ and seeing and hearing the words that normally go together with it. This would take the role of a corpus.
  2. Have students search for a video on a particular word to see if they can figure out what it means in context.
  3. Since the video is fairly short and starts in the middle of a conversation, see if they can guess what happened before this section of the video or what is coming up next. This could be followed up with a research project on the topic mentioned.
  4. Students can listen to the pronunciation of a word along with the possible intonation or rhythmic usage in different contexts.

These are just a few ideas that come to mind. If you have any ideas that you would like to contribute, please share them in the comment section below or send me a tweet at @nathanghall. Thank you!