To add large corpora or document collections to Tesla (such as the BNC), you can create a new data source, as described below. This tutorial assumes that a zip file containing several documents is added, such that it is not necessary to implement a custom document provider. In case you want to implement a document provider for a different data source type, have a look at the interface de.uni_koeln.spinfo.tesla.datasource.IDocumentProvider, its subinterfaces and its implementations.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <!-- A system-wide unique id of the data source must be defined here --> <ns2:corpus_configuration id="BNC" xmlns:ns2="http://spinfo.uni-koeln.de/tesla"> <!-- An instance of this class will provide access to the documents and their contents --> <providerClass>de.uni_koeln.spinfo.tesla.datasource.zip.ZipDocumentProvider</providerClass> <!-- This name will be displayed in the Corpus Manager View --> <displayName>British National Corpus</displayName> <!-- Short description of the data source --> <description>The British National Corpus in XML</description> <!-- The reader which will extract text and annotations from the documents --> <readerClass>de.uni_koeln.spinfo.tesla.component.reader.BNCReader</readerClass> <!-- Encoding of the stored documents --> <encoding>UTF-8</encoding> <!-- Whether the files should be indexed (made searchable) or not. --> <!-- In case of dynamic data sources, which are modified outside of Tesla, --> <!-- this flag should be turned off (set to false) --> <indexed>true</indexed> <!-- The following configuration options depend on the chosen document provider --> <configurations> <entry> <key>path</key> <value>bnc.zip</value> </entry> <entry> <key>suffix</key> <value>.xml</value> </entry> </configurations> <!-- Dublin core metadata which will be added to each document --> <metaData> <format>XML (TEI, Custom, UTF-8)</format> <source>British National Corpus, XML Edition</source> <language>English</language> <rights>BNC User Licence http://www.natcorp.ox.ac.uk/docs/licence.html</rights> <publisher>University of Oxford</publisher> </metaData> </ns2:corpus_configuration>