How to create a custom data source

To add large corpora or document collections to Tesla (such as the BNC), you can create a new data source, as described below. This tutorial assumes that a zip file containing several documents is added, such that it is not necessary to implement a custom document provider. In case you want to implement a document provider for a different data source type, have a look at the interface de.uni_koeln.spinfo.tesla.datasource.IDocumentProvider, its subinterfaces and its implementations.

Step 1
Create a new xml file in Tesla's datasource directory (INSTALL_DIR/configuration/datasources).
Step 2
Insert the following xml content:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <!-- A system-wide unique id of the data source must be defined here -->
    <ns2:corpus_configuration id="BNC" xmlns:ns2="">
    <!-- An instance of this class will provide access to the documents and their contents -->
    <!-- This name will be displayed in the Corpus Manager View -->
    <displayName>British National Corpus</displayName>
    <!-- Short description of the data source -->
    <description>The British National Corpus in XML</description>
    <!-- The reader which will extract text and annotations from the documents -->
    <!-- Encoding of the stored documents -->
    <!-- Whether the files should be indexed (made searchable) or not. -->
    <!-- In case of dynamic data sources, which are modified outside of Tesla, -->
    <!-- this flag should be turned off (set to false) -->
    <!-- The following configuration options depend on the chosen document provider -->
    <!-- Dublin core metadata which will be added to each document -->
            <format>XML (TEI, Custom, UTF-8)</format>
            <source>British National Corpus, XML Edition</source>
            <rights>BNC User Licence</rights>
            <publisher>University of Oxford</publisher>
Step 3
Modify the meta data as needed: Choose a unique id and displayName, enter a good description, and choose appropriate Dublin-Core tags for general information in the metaData-element.
Step 4
Modify the access elements: Enter fully qualified class names for the data source provider and the default reader and define the encoding of the documents. Depending on the chosen data source provider, you also have to define how to access the data - in case of the ZipDocumentProvider, only the path to the zip file has to be defined. This path can either be absolute, or relative to the directory INSTALL_DIR/datasources. You can also define a filter based on file name suffixes, such that only files of a given type will be analyzed.
Step 5
Restart the server. The new data source will be analyzed and indexed (if enabled), and finally will show up as a new corpus in Tesla's corpus manager.