Advanced Knowledge Technologies is recognised as a leading research programme conducted at some of the foremost informatics departments in Britain. It is also a training ground for a new generation of researchers. To highlight the work of these students, a
di eringreliability;informationfoundcanevenbespam,typos,deliberatemis-
informationorsimplyerroneous.Wheninformationiscontainedintextualdoc-
uments,extractingitrequiresmoresophisticatedIEmethodologiesbasedon
linguisticanalysisandmethodstoensureadegreeofreliabilityofextracted
information.
Informationfromwhateversourceisoftenredundant,i.e.thatitcanbe
foundindi erentcontextsandindi erentsuper cialformats-theredundancy
ofinformationcaninitselfbeaweakproofofitsvalidity[5].However,wesuggest,
thatthisaloneisnotenough,additionalinformationeitherwithinorsurrounding
extractedentitiescanbeusedtodevelopadditionalevidence.Armadillonowuses
anevidencebuildingapproachofnumerousrudimentarytechniques,(basedon
SimMetrics,Section3).
4.2HowArmadilloworks?
Armadillolearnshowtobestextractinformationinthefollowingway:
1.itminesacoherentportionoftherepository(e.g.awebsiteoraclassof
sites);
2.itintegratesinformationandassignsreliabilitiesofdi erentsources(e.g.
digitallibraries,services,webpages).Theseratingsareusedtodirectthe
learningfromtherepository;
3.itdiscoversnewinformationintherepositorythatinturnisratedandused
tobootstrapnewlearninguntilastableinformationbaseisreached;
4.itstorestheharvestedinformationintoaRDFKnowledgebase.Thedatabase
canthenbeusedtoaccesstheextractedinformation(asdetailedlater)or
toproduceindicesfordocumentretrievalorannotation.
Armadilloisadatadrivensystemtypicallybeginingfromrigidlystructured
reliablesourcesusingexamplesprovidedbyeitherawrapper,theuserorprevious
data.Armadillousespreviouslyobtainedseedstolearnonmorecomplexsources
(e.g.freetexts)usingthepreviouslyacquiredinformation.
InordertoexplaininmoredepththeworkingofArmadilloanexampleis
nowdetailed.
4.3TheComputerScienceDepartmentApplication
ConsiderthefollowingexampletaskofminingwebsitesofComputerScience
Departmentsto ndacademics(name,position,homepage,emailaddress,tele-
phonenumberandalistofpublicationsmorecompletethantheoneprovided
byrepositoriessuchasCiteseer).
Simplydiscoveringwhoworksforadepartmentismorecomplexthangeneric
NamedEntityRecognition(NER)asmanyirrelevantpeople’snamesaremen-
tionedinasite,sofundergraduatestudents,secretaries,aswellas
namesofresearchersfromexternalsitesandhenceirrelevantforthistask.
Armadillousesastatisticalevidencebasedloopingapproachtoaidtheval-
idationtask.Initiallyaquicklistofpotentialnamesofpeopleworkinginthe
搜索“diyifanwen.net”或“第一范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,第一范文网,提供最新人文社科Southampton and The Open University. Preface(15)全文阅读和word下载服务。
相关推荐: