Factbites
 Where results make sense
About us   |   Why use us?   |   Reviews   |   PR   |   Contact us  

Topic: Text corpus


Related Topics

In the News (Sat 11 Oct 08)

  
  Text corpus - Wikipedia, the free encyclopedia
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed).
A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).
Scottish Corpus of Texts and Speech: Multimedia corpus of Scots and Scottish English
en.wikipedia.org /wiki/Text_corpus   (324 words)

  
 Corpus Survey
The PNC is TEI-compliant and is annotated for part-of-speech.
Texts representing various blends of written and spoken language such as lectures, political speeches and play scripts are included in a special section in the written corpus (cf.
The BOKR corpus is encoded in TEI-compliant SGML and annotated for part-of-speech.
bowland-files.lancs.ac.uk /corplang/cbls/corpora.asp   (1870 words)

  
 Linguist List - Web Resource Listings
Penn-Helsinki Parsed Corpus of Early Modern English: The Penn-Helsinki Parsed Corpus of Early Modern English is a 1.8 million word parsed corpus of text samples of Early Modern English.
Scottish Corpus of Texts and Speech (SCOTS): SCOTS is an AHRC-funded project, creating a corpus of texts in the languages of Scotland, in the first instance Scots and Scottish English, of all available genres.
The Lancaster Corpus of Mandarin Chinese: The Lancaster Corpus of Mandarin Chinese (LCMC) is designed as a Chinese match for the FLOB and FROWN corpora for modern British and American English.
linguistlist.org /sp/Texts.html   (5088 words)

  
 A Crash Course in Corpus Linguistics
Corpus linguistics methods are ideal for research on registers and register differences, because in order to establish similarities and/or differences between registers huge amount of texts are needed.
The Corpus of Middle English Prose or Verse is a part of the Middle English Compendium, also containing the Middle English Dictionary and a HyperBibliography of Middle English Prose and Verse.
The TIMIT corpus is a corpus of recorded speech, containing 6,300 sentences, recorded from male and female speakers of eight dialects of American English.
www.ling.unt.edu /corpus.html   (3390 words)

  
 Corpus linguistics - Wikipedia, the free encyclopedia
Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text.
The core of a corpus is the derivation of a set of Part-of-speech tags, representing a formal overview of the various types of words and word-relationships in a given language.
A landmark in modern corpus linguistics was the publication by Henry Kucera and Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources.
en.wikipedia.org /wiki/Corpus_linguistics   (605 words)

  
 Language Corpora   (Site not responding. Last check: 2007-10-15)
A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed `core' language.
All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types.
for a corpus text is understood to be prefixed by the
www.tei-c.org /P4X/CC.html   (7845 words)

  
 Encoding the British National Corpus
OUCS's main role in the project is to encode all corpus texts in a standard format, and to act as a central clearing-house for the exchange and storage of corpus texts for all parties involved in BNC construction work.
This illustrates one important reason for the use of TEI-conformant SGML in distributing the corpus: researchers are free to convert to and from the encoding systems and software they are happy using in their local set-up, but for the exchange of data between different computational set-ups, a single, standardized encoding scheme is to be preferred.
For example, when text is captured using optical character recognition, it is cheap and easy to capture changes in type style, but manual intervention is required to mark poetry, and to insert footnote text at its point of reference.
xml.coverpages.org /bnc-encoding2.html   (4530 words)

  
 The British National Corpus
Newspaper text is increasingly easy to acquire, since most publishers now use computerised technology to produce their papers, and because, once it is published, most newspaper text is no longer thought of as work whose copyright must be closely guarded.
The availability of newspaper text means that many researchers and teachers use it in their work; the value of the BNC's newspaper text is in its quantity, its range and its balance.
Texts, or normally only sampled sections of texts, were included without charge to the BNC project, on condition that no commercial exploitation was to be carried out from the corpus, and that the corpus would be issued to users under the terms of a standardised license agreement protecting the owners' rights.
www.natcorp.ox.ac.uk /archive/papers/gblibs.html   (5729 words)

  
 Reading Academic Text Corpus
Since the corpus was originally established in the academic year 1995­6, the number of theses has increased from 8 to 38, and the Centre is planning to expand the corpus further.
Recent research work conducted on texts in the corpus has investigated the organization of theses in different disciplines, the uses of citations (a report on this study can be viewed online), and of the means by which student writers position themselves within their texts.
This corpus would then be available to materials developers at the School as a source of authentic academic text data that can be used to build up an academic vocabulary list, and to provide examples of authentic academic language use for analysis.
www.rdg.ac.uk /app_ling/corpus.htm   (867 words)

  
 Models for Interacting Populations of Memes: Competition and Niche Behaviour
We analyze a particular corpus of posts to the soc.women newsgroup and argue that strong negative cross-correlations are examples of competition between the quasi-species.
We are describing phenomena within a corpus of texts in terms of population ecology and population genetics.
This cluster was found within a corpus of all posts sent to the soc.women newsgroup between January 8, 1997 (the far left of the graph) and January 28, 1997 (the far right).
cfpm.org /jom-emit/1997/vol1/best_ml.html   (6579 words)

  
 eMedicine - Corpus Callosum, Agenesis : Article Excerpt by: Manohar Aribandi, MD   (Site not responding. Last check: 2007-10-15)
Pathophysiology: Dysgenesis of corpus callosum is usually a sporadic occurrence, although the incidence is increased in patients with trisomy 18, trisomy 13, and trisomy 8.
Fibers of the corpus callosum arise from the superficial layers of the cerebral cortex and they project to the homotypic region of the contralateral cortex by passing through the corpus callosum while crossing the midline.
Secondary destruction of corpus callosum occurs when the genu and anterior body are destroyed, leaving the posterior portion of the corpus callosum intact.
www.emedicine.com /radio/byname/corpus-callosum-agenesis.htm   (524 words)

  
 Cover Pages: Electronic Text Corpus of Sumerian Literature (ETCSL)
This standardised, electronically searchable SGML corpus, which is based to a large degree on published materials, comprises some 400 literary compositions of the Isin/Larsa/Old Babylonian Period, amounting to approximately 40,000 lines of verse (excluding Emesal cult songs, literary letters, and magical incantations).
The compositions are presented in single-line composite text format (in a standardised transliteration) with newly-prepared English prose translations, and a full bibliographical database, thereby making available for the first time a collected works of Sumerian literature.
The corpus comprises: (1) an information database; (2) transliterations of 13 ancient literary catalogues; (3) composite texts of 409 literary compositions; (4) new translations of all the composite texts.
www.oasis-open.org /cover/etcsl.html   (1098 words)

  
 AQUAINT Text Corpus   (Site not responding. Last check: 2007-10-15)
The text data are separated into directories by source (apw, nyt, xie); within each source, data files are subdivided by year, and within each year, there is one file per date of collection.
The sampling for this corpus covers the period from January 1996 to September 2000, inclusive, for the Xinhua text collection, and from June 1998 to September 2000, inclusive, for New York Times and Associated Press.
However, despite our best intentions, there is unavoidable variation in the formatting of text data transmitted over these newswire services, and in a small percentage of stories, the typical cues for delimiting the TEXT content are lacking.
www.ldc.upenn.edu /Catalog/docs/LDC2002T31   (1010 words)

  
 LANGUAGE LEARNING article--Building a Corpus of Comprehensible Text
The texts from the first weeks which had your LRP telling you things that required your physical response may provide isolated examples of linguistic phenomena, but they involve a relatively inauthentic use of language.
Thus Hale recommends that along with the text, a record be preserved which attempts “to characterize as accurately as possible what the speaker was doing, or hoping to accomplish through the discourse.” It is also important to note the audience and the quality of their interaction with the speaker.
There is much more to be said about the development of a corpus of text for linguistic and analytical purposes, but our topic here is the development of a comprehensible corpus as a language learning method.
www.languageimpact.com /articles/gt/building_corpus.htm   (7117 words)

  
 BioMed Central | about us | Data mining research
BioMed Central has so far published 21187 articles of peer-reviewed biomedical research, all of which are covered by our open access license agreement which allows free distribution and re-use of the full- text article, including the highly structured XML version.
Whether or not your research used the BioMed Central corpus, BioMed Central is keen to publish high quality research in the area of text mining and biomedical literature analysis.
Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line
www.biomedcentral.com /info/about/datamining   (626 words)

  
 Corpus Analysis for Interpreters
Corpus Linguistics is often understood as being a relatively new approach in linguistics.
Ad-hoc corpora are collections of texts related to a specific subject used to investigate a particular language and terminology.
Using specialized comparable corpora (collection of texts composed independently in the respective languages and put together on the basis of similarity of content, domain and communicative function) the interpreters obtain useful terminology and content information.
sslmit.unibo.it /~cfantinuoli/corpus.html   (1644 words)

  
 Language data resources on the Internet
Electronic Text Center at UVirginia combines an on-line archive of thousands of SGML-encoded electronic texts (some of which are publicly available) with a library-based Center housing hardware and software suitable for the creation and analysis of text
Corpus Linguistics, by Michael Barlow at Rice University, includes a list of corpora by language.
Penn-Helsinki Parsed Corpus of Middle English, a database of 510,000 words of syntactically parsed Middle English text for use by historical linguists
www.sil.org /linguistics/etext.html   (440 words)

  
 Calgary Corpus   (Site not responding. Last check: 2007-10-15)
The Calgary Text Compression Corpus was founded by Ian Witten and Tim Bell, two prominent researchers from New Zealand that happened to spend some time at the University of Calgary, Canada.
Nine different types of text are represented, and to confirm that the performance of schemes is consistent for any given type, many of the types have more than one representative.
The file geo is particularly difficult to compress because it contains a wide range of data values, while the file pic is highly compressible because of large amounts of white space in the picture, represented by long runs of zeros.
links.uwaterloo.ca /calgary.corpus.html   (427 words)

  
 Flags and Lollipops - Bioinformatics Blog: Distributed text corpus tagging
I've been thinking about Amazon's Mechanical Turk (a scheme which gets humans to perform short, repetitive classification tasks that are easy but boring for them, but very difficult for computers) and about user driven annotation, as in the call for a gene function wiki (via Nodalpoint).
To build up a sufficiently large corpus for biomedical related natural language parsing tasks you could develop a freely available Firefox extension - a toolbar - that appears when it thinks that you're reading an abstract.
The extension uses AJAX to call a central server in the background and to pass the current URL, the tag and the highlighted text (along with its position in the abstract, so that we can extract some context).
www.ghastlyfop.com /blog/2006/02/distributed-text-corpus-tagging.html   (826 words)

  
 Corpora, Text Resources
The ECI Multilingual Corpus I - Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more.
British National Corpus - The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
REAL Centre at Chemnitz, Germany - The German - English Translation Corpus, The Corpus of East African English, A million-word corpus of Early Modern English.
www-a2k.is.tokushima-u.ac.jp /member/kita/NLP/corpora.html   (502 words)

  
 LINGUIST List 17.2809: Text, Corpus Ling/United Kingdom; General Ling/Brazil
As a matter of policy, LINGUIST discourages the use of abbreviations or acronyms in conference announcements unless they are explained in the text.
Following the success of the workshop of the same name held in 2002, Les français des corpus 2 aims to assess the progress made and still to be made in the collection of French corpora.
Contributions are welcomed with respect to reference corpora, learner corpora and parallel (translation) corpora, as well as linguistic studies based on French corpora.
www.linguistlist.org /issues/17/17-2809.html   (492 words)

  
 SGML: JURIS Text Corpus from LDC
The text data contained on this two-CD-ROM set represent a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice.
There are a total of 694,667 document units in the corpus, and these can be categorized to some extent with regard to their content.
The text files are all formatted using a set of SGML tags to mark document boundaries, and to mark major structural features within documents.
xml.coverpages.org /ldc19981001.html   (734 words)

  
 Corpus Software (Text Analysis)
COSMAS - A corpus analysis toolbox, online accessible since 1995, see COSMAS.
A version is available for free for research purposes (under license).
Free Text, a Mac concordance program, should be available from the U. of Michigan site.
www.athel.com /corpus_software.html   (150 words)

  
 Corpora and Corpus Linguistics
Chapter 1, Corpus and Text: Basic Principles John Sinclair (Tuscan Word Centre).
Chapter 4 Character Encoding in Corpus Construction Anthony McEnery and Richard Xiao (Lancaster University).
This site was originally a Corpus Linguistics site at Rice University and consisted of a long list of links.
www.athel.com /corpus.html   (177 words)

  
 Sumerian literature: ETCSL: The Electronic Text Corpus of Sumerian Literature   (Site not responding. Last check: 2007-10-15)
The Electronic Text Corpus of Sumerian Literature is based at the University of Oxford.
The three catalogues provide access to the Sumerian literary compositions published here, while the information pages describe aspects of the website and the project, including information about recent changes to the site.
This is the first edition of the corpus.
www-etcsl.orient.ox.ac.uk /index1.htm   (185 words)

  
 [No title]
Cuneiform Circle is a community of scholars engaged in the study of the Old Babylonian Akkadian.
The Old Babylonian Text Corpus (OBTC) comprises a large text database of the Old Babylonian Akkadian Language (currently 122529 text lines, letters, documents, legal texts, royal inscriptions, omina, mathematical texts etc.).
The search engine in Old Babylonian Text Corpus verse 1 is available for everybody but the search is restricted to the following texts: Codex Hammurapi and AbB 5.
www.klinopis.cz   (333 words)

  
 The Neo-Assyrian Text Corpus Project
The Neo-Assyrian Text Corpus Project, started in 1986, is a long-term undertaking to
use the CNA database to publish up-to-date critical text editions of texts written in Neo-Assyrian in a series of volumes organized by text genre (
publish a series of facsimile cuneiform texts, for both classroom and general research use, based primarily on the texts from Assurbanipal’s library (SAACT);
www.helsinki.fi /science/saa/cna.html   (464 words)

Try your search on: Qwika (all wikis)

Factbites
  About us   |   Why use us?   |   Reviews   |   Press   |   Contact us  
Copyright © 2005-2007 www.factbites.com Usage implies agreement with terms.