Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractDocumentLengths: prints out sum of doclengths, both lossy and lossless #1040

Merged
merged 4 commits into from
Mar 20, 2020

Conversation

lintool
Copy link
Member

@lintool lintool commented Mar 20, 2020

Ref: osirrc/ciff#21

Adds check in ExtractDocumentLengths per above issue.

@codecov
Copy link

codecov bot commented Mar 20, 2020

Codecov Report

Merging #1040 into master will increase coverage by 0.34%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1040      +/-   ##
============================================
+ Coverage     43.18%   43.53%   +0.34%     
- Complexity      610      622      +12     
============================================
  Files           128      128              
  Lines          7750     7759       +9     
  Branches       1131     1131              
============================================
+ Hits           3347     3378      +31     
+ Misses         4082     4062      -20     
+ Partials        321      319       -2     
Impacted Files Coverage Δ Complexity Δ
.../java/io/anserini/util/ExtractDocumentLengths.java 86.84% <100.00%> (+10.98%) 3.00 <0.00> (+1.00)
...anserini/ltr/feature/base/PMIFeatureExtractor.java 86.53% <0.00%> (+1.92%) 13.00% <0.00%> (+1.00%)
...java/io/anserini/ltr/feature/CountBigramPairs.java 89.61% <0.00%> (+24.67%) 33.00% <0.00%> (+10.00%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b07219...a5bd731. Read the comment docs.

Copy link
Collaborator

@chriskamphuis chriskamphuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but does not work for core18.
core18 has documents with length 0 which will result in the index not storing a docvector and resulting the following line to return null, and consequently throwing an error.

Terms terms = reader.getTermVector(i, "contents");

@chriskamphuis
Copy link
Collaborator

(I suppose this problem already existed)

@lintool
Copy link
Member Author

lintool commented Mar 20, 2020

Fixed the issue you mentioned while I was at it...

@lintool lintool merged commit deae4b1 into master Mar 20, 2020
@lintool lintool deleted the doclength branch March 20, 2020 11:29
crystina-z pushed a commit to crystina-z/anserini that referenced this pull request Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants