Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH+FIX+WIP fragmented EDUs, nuc in SimpleRSTTree, Parseval, one file per doc #111

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
615e6ea
MAINT educe.rst_dt.similarity flake8
moreymat Jan 27, 2017
fee0a88
ENH move, refactor structured metrics (inc. Parseval) from attelo
moreymat Jan 27, 2017
43ebe63
DOC+MAINT educe.external improve docstrings, style
moreymat Jan 27, 2017
0637228
MAINT educe.pdtb flake8, pylint
moreymat Jan 27, 2017
bd2a3e7
MAINT educe.stac.{edit,oneoff} minor style
moreymat Jan 27, 2017
3a1e338
MAINT educe.stac.sanity pylint
moreymat Jan 27, 2017
f997600
MAINT educe.stac.util minor style
moreymat Jan 27, 2017
abeefef
MAINT rename local csv modules to {educe,stac}_csv_format
moreymat Jan 27, 2017
a361c6b
MAINT educe.stac more pylint
moreymat Jan 27, 2017
de70898
FIX+MAINT catch up with renamed module, pylint
moreymat Jan 27, 2017
f781c81
MAINT educe/*.py pylint, minor fixes for style
moreymat Jan 28, 2017
3736c3f
DOC+MAINT docstring, style
moreymat Jan 28, 2017
8ef1740
MAINT educe.rst_dt refactoring, same_unit ; stac inquirer
moreymat Jan 30, 2017
5f62f46
WIP educe.stac document-centric feature extraction
moreymat Jan 31, 2017
199b534
WIP educe.rst_dt document-centric feature extraction
moreymat Jan 31, 2017
aab0212
WIP document-centric feature extraction, contd.
moreymat Feb 1, 2017
0f7d678
WIP document-centric feature extraction, part 3
moreymat Feb 2, 2017
6b37d86
WIP rst_dt: fragmented EDUs
moreymat Feb 2, 2017
207bbd5
WIP disdep format
moreymat Feb 2, 2017
4ee8c03
WIP feature extraction runs on RST-DT, file_split=corpus
moreymat Feb 2, 2017
7cfc500
FIX from/to SimpleRSTTree: nuc moved up too
moreymat Feb 9, 2017
c689eb5
DOC fix a few docstrings
moreymat Feb 10, 2017
ca2f2ac
DOC warn about the bug-prone API of deptree_to_simple_rst_tree
moreymat Feb 10, 2017
f808be1
FIX dump in feature extraction for STAC
moreymat Feb 14, 2017
6a0cb77
FIX load and dump labels for STAC
moreymat Feb 14, 2017
e0bf3dc
FIX paths to data files under data/{corpus_name}/
moreymat Feb 17, 2017
71db308
Merge remote-tracking branch 'upstream/master' into enh-educe-metrics
moreymat Apr 11, 2017
b920785
FIX rename metrics to S, N, R, F
moreymat Apr 11, 2017
eef2bbc
ENH parseval_compact_report
moreymat Apr 11, 2017
b9d8a56
ENH ctree spans can be in chars
moreymat Apr 12, 2017
62cab3e
ENH display percentages
moreymat May 16, 2017
c7044d5
ENH rst_dt.annotation._binarize() param branching
moreymat May 17, 2017
695d40c
ENH compact_report: parser_true
moreymat May 17, 2017
6050252
DOC minor fix
moreymat May 17, 2017
fe4c0c2
ENH parseval similarity matrix
moreymat May 18, 2017
7e6b709
FIX educe.metrics.parseval missing newline
moreymat May 21, 2017
1752202
FIX pairwise sim report: no underscore
moreymat May 21, 2017
f462d47
MAINT backport compatible changes and fixes from master, eg. doc_glob
moreymat Jun 7, 2017
48f7c26
MAINT minor cleanup
moreymat Jun 7, 2017
6dfa47f
ENH backwards compat: option to load/dump labels from/to features file
moreymat Jun 9, 2017
b55fa25
MAINT minor changes in layout, docstring
moreymat Jun 9, 2017
80caf8f
MAINT rm unnecessary imports
moreymat Jun 9, 2017
a855be0
FIX load_labels: set default to 'file'
moreymat Jun 9, 2017
fffbbff
MAINT minor refactoring, cleanups
moreymat Jun 13, 2017
8c2bd10
MAINT correct minor divergences to get closer to enh-dump-formats
moreymat Jun 30, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 127 additions & 47 deletions educe/annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,43 +200,73 @@ def __repr__(self):

# pylint: disable=no-self-use
class Standoff(object):
"""
A standoff object ultimately points to some piece of text.
The pointing is not necessarily direct though
"""A standoff object ultimately points to some piece of text.

The pointing is not necessarily direct though.

Attributes
----------
origin : educe.corpus.FileId, optional
FileId of the document supporting this standoff.
"""
def __init__(self, origin=None):
self.origin = origin

def _members(self):
"""
Any annotations contained within this annotation.
"""Any annotations contained within this annotation.

Must return None if is a terminal annotation (not the same
meaning as returning the empty list)
meaning as returning the empty list).
Non-terminal annotations must override this.

Returns
-------
res : list of Standoff or None
Annotations contained within this annotation ; None for
terminal annotations.
"""
return None

def _terminals(self, seen=None):
"""
"""Terminal annotations contained within this annotation.

For terminal annotations, this is just the annotation itself.
For non-terminal annotations, this recursively fetches the
terminals
terminals.

Parameters
----------
seen : optional
List of already annotations that have already been seen, so
as to avoid returning duplicates.

Returns
-------
res : list of Standoff
List of terminal annotations for this annotation.
"""
my_members = self._members()
seen = seen or []
if my_members is None:
return [self]
else:
return chain.from_iterable([m._terminals(seen + my_members)
for m in my_members if m not in seen])
seen = seen or []
return chain.from_iterable([m._terminals(seen=seen + my_members)
for m in my_members if m not in seen])

def text_span(self):
"""
Return the span from the earliest terminal annotation contained here
to the latest.

Corner case: if this is an empty non-terminal (which would be a very
weird thing indeed), return None
weird thing indeed), return None.

Returns
-------
res : Span or None
Span from the first character of the earliest terminal
annotation contained here, to the last character of the
latest terminal annotation ; None if this annotation has no
terminal.
"""
terminals = list(self._terminals())
if len(terminals) > 0:
Expand All @@ -248,34 +278,73 @@ def text_span(self):

def encloses(self, other):
"""
True if this annotations's span encloses the span of the other.
True if this annotation's span encloses the span of the other.

`s1.encloses(s2)` is shorthand for
`s1.text_span().encloses(s2.text_span())`

Parameters
----------
other : Standoff
Other annotation.

Returns
-------
res : boolean
True if this annotation's span encloses the span of the
other.
"""
return self.text_span().encloses(other.text_span())

def overlaps(self, other):
"""
True if this annotations's span encloses the span of the other.
True if this annotations's span overlaps with the span of the other.

`s1.overlaps(s2)` is shorthand for
`s1.text_span().overlaps(s2.text_span())`

Parameters
----------
other : Standoff
Other annotation.

Returns
-------
res : boolean
True if this annotation's span overlaps with the span of the
other.
"""
return self.text_span().overlaps(other.text_span())
# pylint: enable=no-self-use


class Annotation(Standoff):
"""
Any sort of annotation. Annotations tend to have
"""Any sort of annotation.

Annotations tend to have:
* span: some sort of location (what they are annotating)
* type: some key label (we call a type)
* features: an attribute to value dictionary
"""
def __init__(self, anno_id, span, atype, features,
metadata=None, origin=None):
def __init__(self, anno_id, span, atype, features, metadata=None,
origin=None):
"""Init method.

Parameters
----------
anno_id : TODO
Identifier for this annotation.
span : Span
Coordinates of the annotated span.
atype : str
Annotation type.
features : dict from str to str
Feature as a dict from feature_name to feature_value.
metadata : dict from str to str, optional
Metadata for the annotation, eg. author, creation date...
origin : FileId, optional
FileId of the document that supports this annotation.
"""
Standoff.__init__(self, origin)
self.origin = origin
self._anno_id = anno_id
Expand All @@ -293,14 +362,16 @@ def __str__(self):
(self.identifier(), self.type, self.span, feats))

def local_id(self):
"""
An identifier which is sufficient to pick out this annotation within a
single annotation file
"""Local identifier.

An identifier which is sufficient to pick out this annotation
within a single annotation file.
"""
return self._anno_id

def identifier(self):
"""
"""Global identifier if possible, else local identifier.

String representation of an identifier that should be unique
to this corpus at least.

Expand All @@ -313,7 +384,7 @@ def identifier(self):
* and the id from the XML file

If we don't have an origin we fall back to just the id provided
by the XML file
by the XML file.

See also `position` as potentially a safer alternative to this
(and what we mean by safer)
Expand All @@ -326,11 +397,14 @@ def identifier(self):


class Unit(Annotation):
"""Unit annotation.

An annotation over a span of text.

"""
An annotation over a span of text
"""
def __init__(self, unit_id, span, utype, features,
metadata=None, origin=None):

def __init__(self, unit_id, span, utype, features, metadata=None,
origin=None):
Annotation.__init__(self, unit_id, span, utype, features,
metadata, origin)

Expand All @@ -351,13 +425,15 @@ def position(self):

**position vs identifier**

This is a trade-off. One the hand, you can see the position as being
a safer way to identify a unit, because it obviates having to worry
about your naming mechanism guaranteeing stability across the board
(eg. two annotators stick an annotation in the same place; does it have
the same name). On the *other* hand, it's a bit harder to uniquely
identify objects that may coincidentally fall in the same span. So
how much do you trust your IDs?
This is a trade-off.
On the one hand, you can see the position as being a safer way
to identify a unit, because it obviates having to worry about
your naming mechanism guaranteeing stability across the board
(eg. two annotators stick an annotation in the same place; does
it have the same name).
On the *other* hand, it's a bit harder to uniquely identify
objects that may coincidentally fall in the same span.
So how much do you trust your IDs?
"""
if self.origin is None:
ostuff = []
Expand All @@ -379,20 +455,24 @@ class Relation(Annotation):
`fleshout` is called (corpus slurping normally fleshes out
documents and thus their relations).

Parameters
----------
rel_id : string
Relation id
span : RelSpan
Pair of units connected by this relation
rtype : string
Relation type
features : dict
Features
metadata : TODO
TODO
"""

def __init__(self, rel_id, span, rtype, features, metadata=None):
"""Init method.

Parameters
----------
rel_id : string
Relation id
span : RelSpan
Pair of units connected by this relation
rtype : string
Relation type
features : dict
Features
metadata : dict from str to str, optional
Metadata for this annotation.
"""
Annotation.__init__(self, rel_id, span, rtype, features, metadata)
self.source = None # to be defined in fleshout
'source annotation; will be defined by fleshout'
Expand Down
Loading