Melissa M. Terras
- Published in print:
- 2006
- Published Online:
- September 2007
- ISBN:
- 9780199204557
- eISBN:
- 9780191708121
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/acprof:oso/9780199204557.003.0003
- Subject:
- Classical Studies, British and Irish History: BCE to 500CE
This chapter focuses on the methodologies used to collate data sets needed to train a system to read unknown characters contained within the stylus tablets. The Vindolanda ink texts are the major ...
More
This chapter focuses on the methodologies used to collate data sets needed to train a system to read unknown characters contained within the stylus tablets. The Vindolanda ink texts are the major significant body of contemporaneous documents to the stylus tablets, and it is demonstrated that the stylus tablets should contain similar Old Roman Cursive character forms. The construction of a corpus of annotated images, based on the ink tablet letter forms, and the gathering of linguistic information from the Vindolanda ink texts that have already been read provided the information needed to train the system. Techniques included knowledge elicitation exercises with experts to gain an understanding of the types of information used to describe and identify ORC character forms; the construction of an encoding scheme; and the adoption and adaptation of an image markup tool to enable the annotation of images of the Vindolanda texts, producing textual descriptions of the letter forms in Extensible Markup Language (XML). This resulted in a corpus of 1700 individual ORC characters. Additional linguistic analysis of the Vindolanda ink texts provided lexicostatistics (word lists, word frequency, and letter frequency) regarding the language used at Vindolanda.Less
This chapter focuses on the methodologies used to collate data sets needed to train a system to read unknown characters contained within the stylus tablets. The Vindolanda ink texts are the major significant body of contemporaneous documents to the stylus tablets, and it is demonstrated that the stylus tablets should contain similar Old Roman Cursive character forms. The construction of a corpus of annotated images, based on the ink tablet letter forms, and the gathering of linguistic information from the Vindolanda ink texts that have already been read provided the information needed to train the system. Techniques included knowledge elicitation exercises with experts to gain an understanding of the types of information used to describe and identify ORC character forms; the construction of an encoding scheme; and the adoption and adaptation of an image markup tool to enable the annotation of images of the Vindolanda texts, producing textual descriptions of the letter forms in Extensible Markup Language (XML). This resulted in a corpus of 1700 individual ORC characters. Additional linguistic analysis of the Vindolanda ink texts provided lexicostatistics (word lists, word frequency, and letter frequency) regarding the language used at Vindolanda.
Susan Hockey
- Published in print:
- 2000
- Published Online:
- October 2011
- ISBN:
- 9780198711940
- eISBN:
- 9780191694912
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/acprof:oso/9780198711940.003.0003
- Subject:
- Literature, Criticism/Theory
This chapter looks at encoding schemes which are necessary to put intelligence in the text. The purpose of encoding within a text is to provide information which will assist a computer program to ...
More
This chapter looks at encoding schemes which are necessary to put intelligence in the text. The purpose of encoding within a text is to provide information which will assist a computer program to perform functions on that text. The information embedded in the text is variously called encoding, markup, or tagging, although the term ‘tagging’ is also used somewhat more narrowly in corpus linguistics to denote encoding for grammatical categories and possibly other linguistic features. Exactly what markup scheme to use depends very much on the nature of the project. For a digital library or large and long-term project, it makes sense to use SGML, or rather XML, for which there is likely to be more support. The TEI is a very good starting point for humanities material and can be used as a basis for the development of a specialized encoding scheme.Less
This chapter looks at encoding schemes which are necessary to put intelligence in the text. The purpose of encoding within a text is to provide information which will assist a computer program to perform functions on that text. The information embedded in the text is variously called encoding, markup, or tagging, although the term ‘tagging’ is also used somewhat more narrowly in corpus linguistics to denote encoding for grammatical categories and possibly other linguistic features. Exactly what markup scheme to use depends very much on the nature of the project. For a digital library or large and long-term project, it makes sense to use SGML, or rather XML, for which there is likely to be more support. The TEI is a very good starting point for humanities material and can be used as a basis for the development of a specialized encoding scheme.
Christopher Walton
- Published in print:
- 2006
- Published Online:
- November 2020
- ISBN:
- 9780199292486
- eISBN:
- 9780191917691
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/oso/9780199292486.003.0008
- Subject:
- Computer Science, Computer Architecture and Logic Design
In the introductory chapter of this book, we discussed the means by which knowledge can be made available on the Web. That is, the representation of the knowledge in a ...
More
In the introductory chapter of this book, we discussed the means by which knowledge can be made available on the Web. That is, the representation of the knowledge in a form by which it can be automatically processed by a computer. To recap, we identified two essential steps that were deemed necessary to achieve this task: 1. We discussed the need to agree on a suitable structure for the knowledge that we wish to represent. This is achieved through the construction of a semantic network, which defines the main concepts of the knowledge, and the relationships between these concepts. We presented an example network that contained the main concepts to differentiate between kinds of cameras. Our network is a conceptualization, or an abstract view of a small part of the world. A conceptualization is defined formally in an ontology, which is in essence a vocabulary for knowledge representation. 2. We discussed the construction of a knowledge base, which is a store of knowledge about a domain in machine-processable form; essentially a database of knowledge. A knowledge base is constructed through the classification of a body of information according to an ontology. The result will be a store of facts and rules that describe the domain. Our example described the classification of different camera features to form a knowledge base. The knowledge base is expressed formally in the language of the ontology over which it is defined. In this chapter we elaborate on these two steps to show how we can define ontologies and knowledge bases specifically for the Web. This will enable us to construct Semantic Web applications that make use of this knowledge. The chapter is devoted to a detailed explanation of the syntax and pragmatics of the RDF, RDFS, and OWL Semantic Web standards. The resource description framework (RDF) is an established standard for knowledge representation on the Web. Taken together with the associated RDF Schema (RDFS) standard, we have a language for representing simple ontologies and knowledge bases on the Web.
Less
In the introductory chapter of this book, we discussed the means by which knowledge can be made available on the Web. That is, the representation of the knowledge in a form by which it can be automatically processed by a computer. To recap, we identified two essential steps that were deemed necessary to achieve this task: 1. We discussed the need to agree on a suitable structure for the knowledge that we wish to represent. This is achieved through the construction of a semantic network, which defines the main concepts of the knowledge, and the relationships between these concepts. We presented an example network that contained the main concepts to differentiate between kinds of cameras. Our network is a conceptualization, or an abstract view of a small part of the world. A conceptualization is defined formally in an ontology, which is in essence a vocabulary for knowledge representation. 2. We discussed the construction of a knowledge base, which is a store of knowledge about a domain in machine-processable form; essentially a database of knowledge. A knowledge base is constructed through the classification of a body of information according to an ontology. The result will be a store of facts and rules that describe the domain. Our example described the classification of different camera features to form a knowledge base. The knowledge base is expressed formally in the language of the ontology over which it is defined. In this chapter we elaborate on these two steps to show how we can define ontologies and knowledge bases specifically for the Web. This will enable us to construct Semantic Web applications that make use of this knowledge. The chapter is devoted to a detailed explanation of the syntax and pragmatics of the RDF, RDFS, and OWL Semantic Web standards. The resource description framework (RDF) is an established standard for knowledge representation on the Web. Taken together with the associated RDF Schema (RDFS) standard, we have a language for representing simple ontologies and knowledge bases on the Web.
Adam Treister
- Published in print:
- 2005
- Published Online:
- November 2020
- ISBN:
- 9780195183146
- eISBN:
- 9780197561898
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/oso/9780195183146.003.0012
- Subject:
- Chemistry, Physical Chemistry
Flow cytometry is a result of the computer revolution. Biologists used fluorescent dyes in microscopy and medicine almost a hundred years before the first ...
More
Flow cytometry is a result of the computer revolution. Biologists used fluorescent dyes in microscopy and medicine almost a hundred years before the first flow cytometer. Only after electronics became sophisticated enough to control individual cells and computers became fast enough to analyze the data coming out of the instrument, and to make a decision in time to deflect the stream, did cell sorting become viable. Since the 1970s, the capabilities of computers have grown exponentially. According to the famed Moore’s Law, the size of the computer, as tracked by the number of transistors on a chip, doubles every 18 months. This rule has held for three decades so far, and new technologies continue to appear to keep that growth on track. The clock speed of chips is now measured in gigahertz—billions of instructions per second—and hard drives are now available with capacities measured in terabytes. Having computers so powerful, cheap, and ubiquitous changes the nature of scientific exploration. We are in the early steps of a long march of biotechnology breakthroughs spawned from this excess of compute power. From genomics to proteomics to high-throughput flow cytometry, the trend in biological research is toward massproduced, high-volume experiments. Automation is the key to scaling their size and scope and to lowering their cost per test. Each step that was previously done by human hands is being delegated to a computer or a robot for the implementation to be more precise and to scale efficiently. From making sort decisions in milliseconds to creating data archives that may last for centuries, computers control the information involved with cytometry, and software controls the computers. As the technology matures and the size and number of exper iments increase, the emphasis of software development switches from instrument control to analysis and management. The challenge for computers is not in running the cytometer any more. The more modern challenge for informatics is to analyze, aggregate, maintain, access, and exchange the huge volume of flow cytometry data. Clinical and other regulated use of cytometry necessitates more rigorous data administration techniques. These techniques introduce issues of security, integrity, and privacy into the processing of data.
Less
Flow cytometry is a result of the computer revolution. Biologists used fluorescent dyes in microscopy and medicine almost a hundred years before the first flow cytometer. Only after electronics became sophisticated enough to control individual cells and computers became fast enough to analyze the data coming out of the instrument, and to make a decision in time to deflect the stream, did cell sorting become viable. Since the 1970s, the capabilities of computers have grown exponentially. According to the famed Moore’s Law, the size of the computer, as tracked by the number of transistors on a chip, doubles every 18 months. This rule has held for three decades so far, and new technologies continue to appear to keep that growth on track. The clock speed of chips is now measured in gigahertz—billions of instructions per second—and hard drives are now available with capacities measured in terabytes. Having computers so powerful, cheap, and ubiquitous changes the nature of scientific exploration. We are in the early steps of a long march of biotechnology breakthroughs spawned from this excess of compute power. From genomics to proteomics to high-throughput flow cytometry, the trend in biological research is toward massproduced, high-volume experiments. Automation is the key to scaling their size and scope and to lowering their cost per test. Each step that was previously done by human hands is being delegated to a computer or a robot for the implementation to be more precise and to scale efficiently. From making sort decisions in milliseconds to creating data archives that may last for centuries, computers control the information involved with cytometry, and software controls the computers. As the technology matures and the size and number of exper iments increase, the emphasis of software development switches from instrument control to analysis and management. The challenge for computers is not in running the cytometer any more. The more modern challenge for informatics is to analyze, aggregate, maintain, access, and exchange the huge volume of flow cytometry data. Clinical and other regulated use of cytometry necessitates more rigorous data administration techniques. These techniques introduce issues of security, integrity, and privacy into the processing of data.
Gard B. Jenset and Barbara McGillivray
- Published in print:
- 2017
- Published Online:
- October 2017
- ISBN:
- 9780198718178
- eISBN:
- 9780191787515
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/oso/9780198718178.003.0004
- Subject:
- Linguistics, Historical Linguistics
Chapter 4 explains the concept and process of annotation for historical corpora, from a theoretical, practical, and technical point of view, and discusses the challenges presented by historical ...
More
Chapter 4 explains the concept and process of annotation for historical corpora, from a theoretical, practical, and technical point of view, and discusses the challenges presented by historical texts. We introduce basic terminology for XML technologies and corpus metadata, and we describe the different levels of linguistic annotation, from spelling normalization to morphological, syntactic, and semantic analysis, and briefly present the state of the art for historical corpora and treebanks. We cover annotation schemes and standards and illustrate the main concepts in corpus annotation with an example from LatinISE, a large annotated Latin corpus.Less
Chapter 4 explains the concept and process of annotation for historical corpora, from a theoretical, practical, and technical point of view, and discusses the challenges presented by historical texts. We introduce basic terminology for XML technologies and corpus metadata, and we describe the different levels of linguistic annotation, from spelling normalization to morphological, syntactic, and semantic analysis, and briefly present the state of the art for historical corpora and treebanks. We cover annotation schemes and standards and illustrate the main concepts in corpus annotation with an example from LatinISE, a large annotated Latin corpus.
Cate Dowd
- Published in print:
- 2020
- Published Online:
- April 2020
- ISBN:
- 9780190655860
- eISBN:
- 9780190098445
- Item type:
- chapter
- Publisher:
- Oxford University Press
- DOI:
- 10.1093/oso/9780190655860.003.0001
- Subject:
- Political Science, International Relations and Politics
Online news systems share some affordances of Turing’s universal machine, especially configurability, but the early generation of web standards enabled data sharing, interoperability, and ultimately ...
More
Online news systems share some affordances of Turing’s universal machine, especially configurability, but the early generation of web standards enabled data sharing, interoperability, and ultimately frameworks to reasoning about digital resources. At the backend of online news, indexing, mark-up languages, and applied logic, provide a base for machine intelligence that ultimately extends to cloud servers and big data. However, XML languages, like RSS, enabled the first phase of sharing stories in the form of newsfeeds. Specific mark-up for online news, such a NewsML, also defined layout and other features of news sites. Tim Berners-Lee established the W3C for online standards in the 1980s, and then on the cusp of the 21st century he proposed semantic and structured approaches for meaningful data sharing online. However, in subsequent years entrepreneurs have appropriated semantic approaches for different ends. The atomisation of data also introduces “personalised” data preferences to pitch news stories.Less
Online news systems share some affordances of Turing’s universal machine, especially configurability, but the early generation of web standards enabled data sharing, interoperability, and ultimately frameworks to reasoning about digital resources. At the backend of online news, indexing, mark-up languages, and applied logic, provide a base for machine intelligence that ultimately extends to cloud servers and big data. However, XML languages, like RSS, enabled the first phase of sharing stories in the form of newsfeeds. Specific mark-up for online news, such a NewsML, also defined layout and other features of news sites. Tim Berners-Lee established the W3C for online standards in the 1980s, and then on the cusp of the 21st century he proposed semantic and structured approaches for meaningful data sharing online. However, in subsequent years entrepreneurs have appropriated semantic approaches for different ends. The atomisation of data also introduces “personalised” data preferences to pitch news stories.