[AISWorld] [IJCAI-2019] Call for participation: FinSBD-2019 Shared Task - Sentence Boundary Detection in PDF Noisy Text in the Financial Domain

Fri Mar 8 05:57:58 EST 2019

Greetings,

We would like to invite you to submit to the shared task on *Sentence
Boundary Detection in PDF Noisy Text in the Financial Domain*, in
conjunction with IJCAI-2019 @ August 10-12, 2019, Macao, China!

Call for Participation: http://finnlp.nlpfin.com

Register:
https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp/shared-task-finsbd

Submission deadline: May 13, 2019
Workshop date: IJCAI-19 August 10-16 2019, Macao, China
<https://www.ijcai19.org>

*------Motivation------*

Sentences are basic units of the written language and detecting the
beginning and end of sentences, or sentence boundary detection (SBD) is a
foundational first step in many Natural Language Processing (NLP)
applications, such as POS tagging; syntactic, semantic, and discourse
parsing; information extraction; or machine translation.

Despite its important role in NLP, sentence boundary detection has so far
not received enough attention. Previous research in the area has been
confined to formal texts only (news, European Parliament proceedings, etc.)
where existing rule-based and machine learning approaches are extremely
accurate (when the data is perfectly clean). No sentence boundary detection
research to date has addressed the problem in noisy texts extracted
automatically from machine-readable formats (generally PDF file format)
files such as financial documents.

In this shared task, we focus on extracting well-segmented sentences from
Financial prospectuses by detecting their beginning and ending boundaries.
These are official PDF documents in which investment funds precisely
describe their characteristics and investment modalities. The most
important step of extracting any information from these files is to parse
them to get a noisy unstructured text, clean it, format information (by
adding several tags) and finally, transform it into semi-structured text,
where sentence boundaries are well marked.

*------Task Design------*

 As part of the *FinNLP*, we present a shared task on sentence boundary
detection in noisy text extracted from financial prospectuses, in two
languages: English and French.

Systems participating in this shared task will be given a set of textual
documents extracted from pdf files, which are to be automatically segmented
to extract a set of well-delimited sentences (clean sentences).

Participants can choose to work on both languages, or submit systems for
one language only.

In addition to the textual version of the documents, we will provide their
PDF original files. Recommendations of additional language resources will
also be listed/provided for some languages by the organizers.

*------Data Format:------*

In the provided dataset, participants will get a json format containing
"text", that corresponds to the text to be segmented, begin_sentence and
end_sentence correspond to all indexes of tokens marking the beginning and
the end of well-formed sentences in the text. Notice that the provided text
was already word tokenized using NLTK, participants should keep this
tokenization as it is since all tokens indexes are instantiated based on
it. The first token in the text will have then the index 0.

[{

* 'text': *" UFF Sélection Alpha AINFORMATIONS CLÉS POUR L' INVESTISSEUR  «
Ce document fournit des informations essentielles aux investisseurs de cet
OPCVM . Il ne s'agit pas d' un document promotionnel . Les informations qu
' il contient vous sont fournies conformément à une obligation l  égale ,
afin de vous aider à comprendre en quoi consiste un investissement dans ce
fonds et quels risques y sont associés . ..." ,

*'begin_sentence':* [8, 21, 31 , ...],

*'end_sentence':* [20, 30, 66, ...]

}]

All of the input text will be preprocessed in a common way to make sure all
participants have access to all of these features at no additional overhead
novelty cost. Rule-based, machine learning, deep learning, or hybrid
techniques are all allowed.

Participants will get annotated training/dev data, and further a blind test
data as a json format but with just the text. They should then predict the
lists begin_sentence and end_sentence and submit the result in the same json
format as of the training data.

*------Important dates------*

*February 28, 2019: First announcement of the shared task and beginning of
registrationMarch 7, 2019: Release of training data and scoring scriptApril
29, 2019: Registration deadline May 6, 2019: Test set made availableMay 13,
2019: Systems' outputs collected May 27, 2019: Shared task system paper
submissions dueJune 17, 2019: Notification of acceptanceJune 24, 2019:
Camera-ready version of shared task system papers dueAugust 10-12, 2019:
FinNLP 2019 Workshop in Macau*

*Read more:*

FinNLP: http://finnlp.nlpfin.com
FinSBD:
https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp/shared-task-finsbd
IJCAI-19: https://ijcai19.org/

Sincerely,

The FinSBD Organizers

IJCAI-19