Skip to content

Commit

Permalink
Basic implementation of a RefSeq type provider, alongside the GenBank…
Browse files Browse the repository at this point in the history
… type provider. The RefSeq mostly works in the same way as the GenBank provider except that it uses a different set of files.

DataFileGenerator.fsx:
- Added the ability to create "refseq" data files based on the list of RefSeq assemblies on the NCBI FTP server.
- Refactoring party to support the above.

Common.fs
- Altered the CacheHelpers module to now support RefSeq. At the moment, instead of keeping the initial RefSeq submodule I made alongside a GenBank and a General submodule, I simply folded all the functions into the CacheHelpers module, since the implementation is exactly the same - only the files to retrieve change.
- The DatabaseName type now has a custom ToString() method that returns the name of the database; this is used when wanting to show messages and create filenames (where it's put into lower case).
- Error messages for failing to find species and accessions specify that they should be valid for the used database.
- RefSeq paths in CacheAccess are now supported, rather than failing with an unsupported message.

DesignTime.fs
- Added a RefSeq type provider which is a copy of the GenBank type provider, though with GenBank references changed to RefSeq.
- Assembly help text specifies data comes from the "NCBI FTP server" now, as well as whether the data being retrieved is GenBank or RefSeq.

RunTime.fsproj
- No longer has the target I added for testing that automatically removes existing BioProviders packages from the NuGet cache.

The new "refseq" data files are also included in the repository now in .\build\data, along with the "genbank" data files being updated to what was on the NCBI server on 15-10-2023.

Signed-off-by: n7581769 <st2.smith@hdr.qut.edu.au>
  • Loading branch information
n7581769 committed Oct 19, 2023
1 parent bb24cb4 commit f613edc
Show file tree
Hide file tree
Showing 112 changed files with 388 additions and 284 deletions.
150 changes: 100 additions & 50 deletions DataFileGenerator.fsx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ open FluentFTP

// ------ Record types used for reading and writing files ------
// Rows for the original GenBank TSV file.
type GenBankRow = {
type FileRow = {
assembly_accession : string
bioproject : string
biosample : string
Expand Down Expand Up @@ -64,13 +64,42 @@ type SpeciesRow = {
species_name : string
}

/// Typed representation of an NCBI Database. NCBI contains two main genome databases
/// GenBank and RefSeq.
type DatabaseName =
| GenBank
| RefSeq

// Returns the base path of the files of each database. Used to remove the
// necessary characters from the URLs in the original assembly list when
// creating the new lists.
member this.GetBasePath() =
match this with
| GenBank -> "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/"
| RefSeq -> "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/"

// Returns the location of the assembly file on the FTP server for the
// database. Does not include the host path.
member this.GetAssemblyFilePath() =
match this with
| GenBank -> "/genomes/genbank/assembly_summary_genbank.txt"
| RefSeq -> "/genomes/refseq/assembly_summary_refseq.txt"

// Returns the name of the database as a string.
member this.GetName() =
match this with
| GenBank -> "GenBank"
| RefSeq -> "RefSeq"

// Returns the filename of the assembly file.
member this.GetFilename() =
match this with
| GenBank -> "assembly_summary_genbank.txt"
| RefSeq -> "assembly_summary_refseq.txt"

// Character array.
let characters = Seq.concat [['#']; ['a' .. 'z']]

// Base URL for GenBank files on the FTP server. Used to delete the correct
// number of characters from the FTP path.
let genBankURL = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/"

// ------ Functions for generating and writing data files ------
// Function for matching the first character of a species name.
// Characters that are not letters are treated as a '#'.
Expand All @@ -81,7 +110,7 @@ let getLookupCharacter (name: string) =

// Generate a list of distinct species with a unique species ID number for
// each, starting with the specified character.
let getSpeciesList (filteredList : GenBankRow list) (count : int) =
let getSpeciesList (filteredList : FileRow list) (count : int) =
// Get a distinct list of species names.
// Also sorts it into alphabetical order.
let distinctList = List.sort (List.distinct (List.map (fun row -> row.organism_name) filteredList))
Expand All @@ -90,20 +119,20 @@ let getSpeciesList (filteredList : GenBankRow list) (count : int) =

// Generate a list of assemblies belonging to the species of a specified
// character, with the correct ID number for their species.
let getAssemblyList (filteredList : GenBankRow list) (speciesList : SpeciesRow list) =
let getAssemblyList (database : DatabaseName) (filteredList : FileRow list) (speciesList : SpeciesRow list) =
// Function for finding a species name match for a certain row.
let findNameMatch row = List.tryFind (fun species -> species.species_name.Equals(row.organism_name)) speciesList
// Filter the CSV rows by those that have one of the organism names in the
// supplied list, and that have a FTP path that isn't "na".
let listWithPaths = List.filter (fun (row : GenBankRow) -> not (row.ftp_path.Equals("na"))) filteredList
let listWithPaths = List.filter (fun (row : FileRow) -> not (row.ftp_path.Equals("na"))) filteredList
// Function for sorting a list of AssemblyRows. It should be in the order
// of species IDs, and then the accessions if the IDs are the same.
let sortAssemblies (assembly1 : AssemblyRow) (assembly2: AssemblyRow) =
match assembly1.species_id.CompareTo(assembly2.species_id) with
| 0 -> assembly1.assembly_accession.CompareTo(assembly2.assembly_accession)
| result -> result
// Return a (sorted) list of AssemblyRows.
List.sortWith sortAssemblies (List.map (fun row -> { species_id = ((findNameMatch row).Value.species_id) ; assembly_accession = row.assembly_accession ; ftp_path = row.ftp_path.[(String.length genBankURL)..] } ) listWithPaths)
List.sortWith sortAssemblies (List.map (fun row -> { species_id = ((findNameMatch row).Value.species_id) ; assembly_accession = row.assembly_accession ; ftp_path = row.ftp_path.[(String.length (database.GetBasePath()))..] } ) listWithPaths)

// Compresses a written text file using GZip compression, writes it to a new
// file and deletes the original.
Expand All @@ -130,7 +159,7 @@ let internal useNCBIConnection (callback) =
// file.
// - If a file doesn't exist, or is older: return to overwrite existing
// file.
// - Otherwise: return to resume existing file (in case it wasn't
// - Otherwise: try to resume existing file (in case it wasn't
// downloaded fully before).
let isNewerFile (localPath: string) (remotePath: string) (connection: FtpClient) =
if (not (File.Exists(localPath))) then
Expand All @@ -154,65 +183,65 @@ let downloadNCBIFile (localPath: string, remotePath: string) =

useNCBIConnection downloadFile

let downloadedFilePath = (Path.Combine(Path.GetTempPath(), "BioProviders_Build", "downloaded_list.txt"))

// ------ Main operations ------

printfn "------------ Starting operations to generate GenBank data file lists for BioProviders. ------------"
// ------ Parsing operations ------

printfn "------ Downloading GenBank summary file to %s...... ------" downloadedFilePath
// Download the corresponding assembly file from the GenBank FTP server and
// parse it into a set of records.
let getFtpList (database : DatabaseName) =
let downloadedFilePath = (Path.Combine(Path.GetTempPath(), "BioProviders_Build", (database.GetFilename())))
printfn "Downloading %s summary file to %s..." (database.GetName()) downloadedFilePath

let status = downloadNCBIFile (downloadedFilePath, "/genomes/genbank/assembly_summary_genbank.txt")
let status = downloadNCBIFile (downloadedFilePath, (database.GetAssemblyFilePath()))

match status with
| FtpStatus.Failed -> failwith "------ Failed to download file from NCBI FTP server. ------"
| FtpStatus.Skipped -> printfn "------ File already downloaded. ------"
| _ -> printfn "------ File downloaded successfully. ------"
match status with
| FtpStatus.Failed -> failwith "Failed to download file from NCBI FTP server."
| FtpStatus.Skipped -> printfn "File already downloaded."
| _ -> printfn "File downloaded successfully."

printfn "------ Loading in GenBank assembly summary TSV... ------"
printfn "Loading in %s assembly summary TSV..." (database.GetName())

// Load in the GenBank file.
(*let reader = new StreamReader("D:\\Users\\Samuel Smith_3\\Documents\\RA\\Downloads\\GenBank FTP\\assembly_summary_genbank_25-09-2023.txt")*)
let reader = new StreamReader(downloadedFilePath)
// Load in the GenBank file.
(*let reader = new StreamReader("D:\\Users\\Samuel Smith_3\\Documents\\RA\\Downloads\\GenBank FTP\\assembly_summary_genbank_25-09-2023.txt")*)
let reader = new StreamReader(downloadedFilePath)

// A function to skip lines that start with ##, to ignore the comment.
let skipFunction (args : ShouldSkipRecordArgs) =
args.Row[0].StartsWith("##")
// A function to skip lines that start with ##, to ignore the comment.
let skipFunction (args : ShouldSkipRecordArgs) =
args.Row[0].StartsWith("##")

// Configuration for the CSV reader. It:
// - Chooses tab as the delimiter;
// - Sets the mode to no escape to ignore quotes;
// - Uses the above function to skip comment lines; and
// - Clear the # symbol on any headers.
let config = new CsvConfiguration(CultureInfo.InvariantCulture)
config.Delimiter <- "\t"
config.Mode <- CsvMode.NoEscape
config.ShouldSkipRecord <- new ShouldSkipRecord(skipFunction)
config.PrepareHeaderForMatch <- fun args -> args.Header.TrimStart('#')
// Configuration for the CSV reader. It:
// - Chooses tab as the delimiter;
// - Sets the mode to no escape to ignore quotes;
// - Uses the above function to skip comment lines; and
// - Clear the # symbol on any headers.
let config = new CsvConfiguration(CultureInfo.InvariantCulture)
config.Delimiter <- "\t"
config.Mode <- CsvMode.NoEscape
config.ShouldSkipRecord <- new ShouldSkipRecord(skipFunction)
config.PrepareHeaderForMatch <- fun args -> args.Header.TrimStart('#')

// Create a CSV reader object and get all records in the loaded file.
let csv = new CsvReader(reader, config)
let records = Seq.toList (csv.GetRecords<GenBankRow>())
// Create a CSV reader object and get all records in the loaded file.
let csv = new CsvReader(reader, config)
let records = Seq.toList (csv.GetRecords<FileRow>())

// Show how many records were loaded.
printfn "Loaded %i records." (List.length records)
printfn "------ TSV loaded successfully. ------"
// Show how many records were loaded.
printfn "%s TSV loaded successfully with a total of %i records." (database.GetName()) (List.length records)
records

// Generate a list of species and assembies for the given characater, and write
// them to a file. An integer acculmulator is used to ensure unique numerical
// IDs for all distinct species.
let generateLists (fullList : GenBankRow list) (acc : int) (character : char) =
let generateLists (database : DatabaseName) (fullList : FileRow list) (acc : int) (character : char) =
// Filter the full list of assemblies for only those that have an organism
// name matching the current character.
let filteredList = List.filter (fun row -> (getLookupCharacter row.organism_name).Equals(character)) fullList

// Generate the lists of species and assemblies for the given character.
let speciesList = (getSpeciesList filteredList acc)
let assemblyList = (getAssemblyList filteredList speciesList)
let assemblyList = (getAssemblyList database filteredList speciesList)

// Generate the filenames for the species and assembly files.
let speciesFilename = $"./build/data/genbank-species-{character}.txt"
let assemblyFilename = $"./build/data/genbank-assemblies-{character}.txt"
let speciesFilename = $"./build/data/{(database.GetName().ToLower())}-species-{character}.txt"
let assemblyFilename = $"./build/data/{(database.GetName().ToLower())}-assemblies-{character}.txt"

// Write the species entries to a file.
let speciesWriter = new StreamWriter(speciesFilename)
Expand All @@ -234,6 +263,27 @@ let generateLists (fullList : GenBankRow list) (acc : int) (character : char) =
// correct number for the next character.
acc + List.length speciesList

printfn "------ Generating new lists from loaded GenBank assembly list... ------"
printfn "------ Successfully generated lists for %i species. ------" (Seq.fold (generateLists records) 0 characters)
// Handles the operations for GenBank.
let generateGenBankLists () =
let database = GenBank
printfn "------ Creating lists for %s ------" (database.GetName())
let records = getFtpList database
printfn "Generating new lists from loaded %s assembly list..." (database.GetName())
printfn "Generated lists for %i species." (Seq.fold (generateLists database records) 0 characters)
printfn "------ %s operations successful. ------" (database.GetName())

// Handles the operations for RefSeq.
let generateRefSeqLists () =
let database = RefSeq
printfn "------ Creating lists for %s ------" (database.GetName())
let records = getFtpList database
printfn "Generating new lists from loaded %s assembly list..." (database.GetName())
printfn "Generated lists for %i species." (Seq.fold (generateLists database records) 0 characters)
printfn "------ %s operations successful. ------" (database.GetName())

// ------ Main program ------

printfn "------------ Starting operations to generate GenBank and RefSeq data file lists for BioProviders. ------------"
generateGenBankLists()
generateRefSeqLists()
printfn "------------ All operations completed. ------------"
Binary file modified build/data/genbank-assemblies-#.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-a.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-b.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-c.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-d.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-e.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-f.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-g.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-h.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-i.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-j.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-k.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-l.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-m.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-n.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-o.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-p.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-q.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-r.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-s.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-t.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-u.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-v.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-w.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-x.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-y.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-assemblies-z.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-#.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-a.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-b.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-c.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-d.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-e.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-f.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-g.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-h.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-i.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-j.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-k.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-l.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-m.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-n.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-o.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-p.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-q.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-r.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-s.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-t.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-u.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-v.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-w.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-x.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-y.txt.gz
Binary file not shown.
Binary file modified build/data/genbank-species-z.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-#.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-a.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-b.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-c.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-d.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-e.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-f.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-g.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-h.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-i.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-j.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-k.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-l.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-m.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-n.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-o.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-p.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-q.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-r.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-s.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-t.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-u.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-v.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-w.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-x.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-y.txt.gz
Binary file not shown.
Binary file added build/data/refseq-assemblies-z.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-#.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-a.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-b.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-c.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-d.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-e.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-f.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-g.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-h.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-i.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-j.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-k.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-l.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-m.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-n.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-o.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-p.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-q.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-r.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-s.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-t.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-u.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-v.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-w.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-x.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-y.txt.gz
Binary file not shown.
Binary file added build/data/refseq-species-z.txt.gz
Binary file not shown.
Loading

0 comments on commit f613edc

Please sign in to comment.