11 Intro to lg computational LE Nieznany (2)

background image

2014-05-06

1

Introduction to linguistics

Lecture 11: Computational linguistics

Sources

• Fromkin, Victoria, Robert Rodman, Nina

Hyams. 2003. An introduction to language.

– Chapter 9: Humans and Computers.

http://view.byu.edu/

(a collection of various

corpora)

2

Language and computers

Computational linguistics (CL)

– a subfield of

linguistics and computer science.

• It describes the interactionof human language

and computers.

• It overlaps with the field of

artificial

intelligence

(AI), a branch of computer science

aiming at computational models of human
cognition.

3

The scope of CL

• CL includes the computer performing:

– The analysis of written text and spoken discourse.
– The translation of text and speech from one

language to another.

– The use of human languages for communication

between computers and people.

– The modelling and testing of linguistic theories.

4

Text analysis

• Computers prove really helpful when handling

texts. They allow to:

– manipulate data easily and rapidly (searching,

sorting, etc.);

– process data accurately and consistenlty;
– automatically annotate data, i.e. to add notes to a

text.

5

Corpora

CORPUS

– (from Latin

corpus

, 'body'; pl.

corpora) traditionally indicates a collection of
texts, esp. complete and self-contained, e.g.

The Corpus of Anglo-Saxon Verse

.

• In linguistics and lexicography – a body of

texts, utterances, or other specimens
considered more or less representative of a lg
and stored as an electronic database.

6

background image

2014-05-06

2

Corpora

• Some purposes that corpora serve:

– collection of examples for linguists;
– data resource for lexicographers;
– instruction material for language teachers and learners;
– training material for natural language processing (NLP).

• Applications of corpora:

– training of speech recognizers;
– training of statistical part-of-speech taggers and parsers;
– training of example-based and statistical machine;
– translation systems.

7

Corpus linguistics

Corpus linguistics

– the study of language as

expressed in samples (corpora) of "real world"

text.

• For linguistic purposes, corpora can help

investigate such questions as:

– What is the order of different types of adjectives in

English?

– With what frequency do older speakers in the

midwest use cool?

– What do you say in English:

think about

or

think on

?

• According to Google (06.05.2014):

think about

-

4,200,000,000 results;

think on

- 3,880,000,000 results.

8

An example of a corpus

BNC

(on-line since 1995) is a collection of a

about 100 million word samples of written
and spoken lg from various sources.

– It is designed to represent a wide cross-section of

current British English.

– Single words or phrases can be looked up:

http://www.natcorp.ox.ac.uk/

9

10

Plain corpora

• Some corpora are

plain

– i.e. without any

information about the text:

– e.g. Project Gutenberg texts were produced by

scanning.

– Then the texts were converted into a collection of

public domain e-books.

– As of March 2014, Project Gutenberg claimed over

45,000 items in its collection.

http://www.gutenberg.org/

(free e-books)

https://archive.org/details/gutenberg

(archive)

11

Corpora and Machine Translation

• Corpora provide actual lg tokens or extract

translation equivalents for Machine
Translation programs.

Machine Translation (MT)

– a subfield of

computational linguistics that investigates the
use of software to translate text or speech
from one natural language to another.

12

background image

2014-05-06

3

Machine translation

• Translation is hard for a computer.
• The computer has to:

– ‟understand” source text;
– Convert it into target language;
– Generate correct target text.

• The procedure looks simple but, in fact, it is a

complex cognitive operation:

– Many translation problems require real-world

knowledge and intuitions about the meaning of the
text.

13

Understanding the source text

Lexical ambiguity

– At morphological level:

• Ambiguity of word vs stem+ending (

tower, flower

);

• Inflections are ambiguous (

books, loaded

);

• Derived form may be lexicalised (

meeting, revolver

).

– Lexicalization = adding words, set phrases, or word patterns to

a language.

Grammatical category ambiguity (e.g.

round

).

Homonymy:

• Alternativemeanings within the same expression.

14

Understanding the source text

Syntactic ambiguity

– Due to combination of grammatically ambiguous

words, e.g.:

Time flies like an arrow, fruit flies like a banana

– Due to alternative interpretations of structure,

e.g.:

The man saw the girl with a telescope

• In addition, there are problems resulting from

differences between languages.

15

Machine translation

• It is difficult to get a literary quality

translation.

• Today’s MT systems can generate rough

translations that give you at least a gist of a
document.

• High quality translations are possible of

specialized narrow domains, e.g. weather
forecasts.

16

Statistical MT

• It is virtually impossible to write an algorithm

that would render natural language grammar.

• Rather than writing explicit rules to translate

natural language,

computer algorithms are

trained

on human-translated parallel texts,

– this allows them to

automatically learn

how to

translate (thanks to neural networks, statistical
methods, etc.).

17

Statistical translation programs

• Translations are generated on the basis of:

statistical models

analysing

bilingual text corpora

.

• E.g.

Google Translate

works by detecting patterns

in hundreds of millions of documents that have
previously been translated by humans,

• then it makes intelligent guesses based on the

patterns it learned.

• The more human-translated documents there are

in a given language, the more likely it is that the
translation will be of good quality.

18

background image

2014-05-06

4

English-Polish MT programs

• Let’s compare two programs translating from

English to Polish (and vice versa):

– Poltran (

http://www.poltran.com/

), Ectaco Inc.,

and

– Translatica (

http://www.translatica.pl/

), PWN.

• The sentence to translate is as follows:

Nie można zapominać także o korzyściach
społecznych i niebezpieczeństwach
wynikających z wadliwego zaprojektowania
budynku.

19

English-Polish MT programs

Poltran:
It is not possible to forget
about social benefits also
and from defect designing
building dangers
subsequent.

Translatica:
It isn't possible to forget ..
also about social benefits
and dangers resulting
from effective designing
the building.

20


Wyszukiwarka

Podobne podstrony:
3 Intro to lg phonol LECTURE201 Nieznany
5 Intro to lg semant LECTURE201 Nieznany
6 Intro to lg pragm1 LECTURE201 Nieznany
8 Intro to lg socio1 LECTURE2014
4 Intro to lg morph LECTURE2014
7 Intro to lg pragm2 LECTURE2014
1 Intro to lg LECTURE2014
9 Intro to lg socio2 LECTURE2014
2 Intro to lg phon LECTURE2014
10 Intro to lg neuroling LECTURE2014
8 Intro to lg socio1 LECTURE2014
4 Intro to lg morph LECTURE2014

więcej podobnych podstron