read_* functions: make limit parameter accept regex pattern or slice

navis-org · Sep 18, 2024 · b187de2 · b187de2
1 parent 14218c1
commit b187de2
Show file tree

Hide file tree

Showing 7 changed files with 114 additions and 48 deletions.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -36,17 +36,17 @@ more consistent and easier to use.
 - New function: [`navis.graph.skeleton_adjacency_matrix`][] computes the node adjacency for skeletons
 - New function: [`navis.graph.simplify_graph`][] simplifies skeleton graphs to only root, branch and leaf nodes while preserving branch length (i.e. weights)
 - New [`NeuronList`][navis.NeuronList] method: [`get_neuron_attributes`][navis.NeuronList.get_neuron_attributes] is analagous to `dict.get`
-- [`NeuronLists`][navis.NeuronList] now implemented the `|` (`__or__`) operator which can be used to get the union of two [`NeuronLists`][navis.NeuronList]
+- [`NeuronLists`][navis.NeuronList] now implement the `|` (`__or__`) operator which can be used to get the union of two [`NeuronLists`][navis.NeuronList]
 - [`navis.Volume`][] now have an (optional) `.units` property similar to neurons
 
 ##### Improvements
 - Plotting:
   - [`navis.plot3d`][]:
     - `legendgroup` parameter (plotly backend) now also sets the legend group's title
     - new parameters for the plotly backend:
-      - `legend` (default `True`): determines whether legends is shown
-      - `legend_orientation` (default `v`): determines whether legend is aranged vertically (`v`) or horizontally (`h`)
-      - `linestyle` (default `-`): determines line style for skeletons
+       - `legend` (default `True`): determines whether legends is shown
+       - `legend_orientation` (default `v`): determines whether legend is aranged vertically (`v`) or horizontally (`h`)
+       - `linestyle` (default `-`): determines line style for skeletons
     - default for `radius` is now `"auto"`
   - [`navis.plot2d`][]:
     - the `view` parameter now also works with `methods` `3d` and `3d_complex`
@@ -55,13 +55,16 @@ more consistent and easier to use.
     - new parameters for methods `3d` and `3d_complex`: `mesh_shade=False` and `non_view_axes3d`
     - the `scalebar` parameter can now be a dictionary used to style (color, width, etc) the scalebar
   - the `connectors` parameter can now be used to show specific connector types (e.g. `connectors="pre"`)
+- I/O:
+  - `read_*` functions are now able to read from FTP servers (`ftp://...`)
+  - the `limit` parameter used in many `read_*` functions can now also be a regex pattern or a `slice`
 - General improvements to docs and tutorials
 
 ##### Fixes
 - Memory usage of `Neuron/Lists` is now correctly re-calculated when the neuron is modified
 - Various fixes and improvements for the MICrONS interface (`navis.interfaces.microns`)
 - [`navis.graph.node_label_sorting`][] now correctly prioritizes total branch length
-- [`navis.TreeNeuron.simple][] now correctly drops soma nodes if they aren't root, branch or leaf points themselves
+- [`navis.TreeNeuron.simple`][] now correctly drops soma nodes if they aren't root, branch or leaf points themselves
 
 ## Version `1.7.0` { data-toc-label="1.7.0" }
 _Date: 25/07/24_

diff --git a/navis/io/base.py b/navis/io/base.py
@@ -48,6 +48,9 @@
 
 DEFAULT_INCLUDE_SUBDIRS = False
 
+# Regular expression to figure out if a string is a regex pattern
+rgx = re.compile(r'[\\\.\?\[\]\+\^\$\*]')
+
 
 def merge_dicts(*dicts: Optional[Dict], **kwargs) -> Dict:
     """Merge dicts and kwargs left to right.
@@ -541,7 +544,7 @@ def read_ftp(
     ) -> "core.NeuronList":
         """Read files from an FTP server.
 
-        This is a dispatcher for `.read_from_tar`.
+        This is a dispatcher for `.read_from_ftp`.
 
         Parameters
         ----------
@@ -613,6 +616,8 @@ def read_from_ftp(
         core.NeuronList
 
         """
+        # When reading in parallel, we expect there to be a global FTP connection
+        # that was initialized once for each worker process.
         if ftp == "GLOBAL":
             if "_FTP" not in globals():
                 raise ValueError("No global FTP connection found.")
@@ -668,8 +673,18 @@ def read_directory(
         """
         files = list(self.files_in_dir(Path(path), include_subdirs))
 
-        if limit:
+        if isinstance(limit, int):
             files = files[:limit]
+        elif isinstance(limit, list):
+            files = [f for f in files if f in limit]
+        elif isinstance(limit, slice):
+            files = files[limit]
+        elif isinstance(limit, str):
+            # Check if limit is a regex
+            if rgx.search(limit):
+                files = [f for f in files if re.search(limit, str(f.name))]
+            else:
+                files = [f for f in files if limit in str(f)]
 
         read_fn = partial(self.read_file_path, attrs=attrs)
         neurons = parallel_read(read_fn, files, parallel)
@@ -1123,6 +1138,14 @@ def parallel_read_archive(
 
     if isinstance(limit, list):
         to_read = [f for f in to_read if f in limit]
+    elif isinstance(limit, slice):
+        to_read = to_read[limit]
+    elif isinstance(limit, str):
+        # Check if limit is a regex
+        if rgx.search(limit):
+            to_read = [f for f in to_read if re.search(limit, f)]
+        else:
+            to_read = [f for f in to_read if limit in f]
 
     prog = partial(
         config.tqdm,
@@ -1159,7 +1182,6 @@ def parallel_read_ftp(
     file_ext,
     limit=None,
     parallel="auto",
-    ignore_hidden=True,
 ) -> List["core.NeuronList"]:
     """Read neurons from an FTP server, potentially in parallel.
 
@@ -1185,13 +1207,6 @@ def parallel_read_ftp(
     parallel :      str | bool | int
                     "auto" or True for n_cores // 2, otherwise int for number of
                     jobs, or false for serial.
-    ignore_hidden : bool
-                    Archives zipped on OSX can end up containing a
-                    `__MACOSX` folder with files that mirror the name of other
-                    files. For example if there is a `123456.swc` in the archive
-                    you might also find a `__MACOSX/._123456.swc`. Reading the
-                    latter will result in an error. If ignore_hidden=True
-                    we will simply ignore all file that starts with "._".
 
     Returns
     -------
@@ -1245,11 +1260,21 @@ def parallel_read_ftp(
             elif file_ext and fname.endswith(file_ext):
                 to_read.append(file)
 
-            if isinstance(limit, int) and len(to_read) >= limit:
-                break
-
-    if isinstance(limit, list):
+    if isinstance(limit, int):
+        to_read = to_read[:limit]
+    elif isinstance(limit, list):
         to_read = [f for f in to_read if f in limit]
+    elif isinstance(limit, slice):
+        to_read = to_read[limit]
+    elif isinstance(limit, str):
+        # Check if limit is a regex
+        if rgx.search(limit):
+            to_read = [f for f in to_read if re.search(limit, f)]
+        else:
+            to_read = [f for f in to_read if limit in f]
+
+    if not to_read:
+        return []
 
     prog = partial(
         config.tqdm,
@@ -1269,6 +1294,8 @@ def parallel_read_ftp(
         else:
             n_cores = int(parallel)
 
+        # We can't send the FTP object to the process (because its socket is not pickleable)
+        # Instead, we need to initialize a new FTP connection in each process via a global variable
         with mp.Pool(
             processes=n_cores, initializer=_ftp_pool_init, initargs=(server, port, path)
         ) as pool:

diff --git a/navis/io/mesh_io.py b/navis/io/mesh_io.py
@@ -62,11 +62,18 @@ def read_mesh(f: Union[str, Iterable],
                         Determines function's output. See Returns.
     errors :            "raise" | "log" | "ignore"
                         If "log" or "ignore", errors will not be raised.
-    limit :             int, optional
-                        If reading from a folder you can use this parameter to
-                        read only the first `limit` files. Useful when
-                        wanting to get a sample from a large library of
-                        meshes.
+    limit :             int | str | slice | list, optional
+                        When reading from a folder or archive you can use this parameter to
+                        restrict the which files read:
+                         - if an integer, will read only the first `limit` SWC files
+                          (useful to get a sample from a large library of meshes)
+                         - if a string, will interpret it as filename (regex) pattern
+                           and only read files that match the pattern; e.g. `limit='.*_R.*'`
+                           will only read files that contain `_R` in their filename
+                         - if a slice (e.g. `slice(10, 20)`) will read only the files in
+                           that range
+                         - a list is expected to be a list of filenames to read from
+                           the folder/archive
     **kwargs
                         Keyword arguments passed to [`navis.MeshNeuron`][]
                         or [`navis.Volume`][]. You can use this to e.g.

diff --git a/navis/io/nmx_io.py b/navis/io/nmx_io.py
@@ -204,11 +204,18 @@ def read_nmx(f: Union[str, pd.DataFrame, Iterable],
                         Precision for data. Defaults to 32 bit integers/floats.
                         If `None` will let pandas infer data types - this
                         typically leads to higher than necessary precision.
-    limit :             int, optional
-                        If reading from a folder you can use this parameter to
-                        read only the first `limit` NMX files. Useful if
-                        wanting to get a sample from a large library of
-                        skeletons.
+    limit :             int | str | slice | list, optional
+                        When reading from a folder or archive you can use this parameter to
+                        restrict the which files read:
+                         - if an integer, will read only the first `limit` SWC files
+                          (useful to get a sample from a large library of skeletons)
+                         - if a string, will interpret it as filename (regex) pattern
+                           and only read files that match the pattern; e.g. `limit='.*_R.*'`
+                           will only read files that contain `_R` in their filename
+                         - if a slice (e.g. `slice(10, 20)`) will read only the files in
+                           that range
+                         - a list is expected to be a list of filenames to read from
+                           the folder/archive
     **kwargs
                         Keyword arguments passed to the construction of
                         `navis.TreeNeuron`. You can use this to e.g. set

diff --git a/navis/io/precomputed_io.py b/navis/io/precomputed_io.py
@@ -252,11 +252,18 @@ def read_precomputed(f: Union[str, io.BytesIO],
                           - `False` = do not use/look for `info` file
                           - `str` = filepath to `info` file
                           - `dict` = already parsed info file
-    limit :             int, optional
-                        If reading from a folder you can use this parameter to
-                        read only the first `limit` files. Useful if
-                        wanting to get a sample from a large library of
-                        skeletons/meshes.
+    limit :             int | str | slice | list, optional
+                        When reading from a folder or archive you can use this parameter to
+                        restrict the which files read:
+                         - if an integer, will read only the first `limit` SWC files
+                          (useful to get a sample from a large library of neurons)
+                         - if a string, will interpret it as filename (regex) pattern
+                           and only read files that match the pattern; e.g. `limit='.*_R.*'`
+                           will only read files that contain `_R` in their filename
+                         - if a slice (e.g. `slice(10, 20)`) will read only the files in
+                           that range
+                         - a list is expected to be a list of filenames to read from
+                           the folder/archive
     parallel :          "auto" | bool | int
                         Defaults to `auto` which means only use parallel
                         processing if more than 200 files are imported. Spawning

diff --git a/navis/io/swc_io.py b/navis/io/swc_io.py
@@ -273,14 +273,20 @@ def read_swc(f: Union[str, pd.DataFrame, Iterable],
 
     Parameters
     ----------
-    f :                 str | pandas.DataFrame | iterable
-                        Filename, folder, SWC string, URL or DataFrame.
-                        If folder, will import all `.swc` files. If a
-                        `.zip`, `.tar` or `.tar.gz` file will read all
-                        SWC files in the file. See also `limit` parameter.
+    f :                 str | pandas.DataFrame | list thereof
+                        Filename, folder, SWC string, URL or DataFrame:
+                         - if folder, will import all `.swc` files
+                         - if a `.zip`, `.tar` or `.tar.gz` archive will read all
+                           SWC files from the file
+                         - if a URL (http:// or https://), will download the
+                           file and import it
+                         - FTP address (ftp://) can point to a folder or a single
+                           file
+                         - DataFrames are interpreted as a SWC tables
+                        See also `limit` parameter to read only a subset of files.
     connector_labels :  dict, optional
                         If provided will extract connectors from SWC.
-                        Dictionary must map type to label:
+                        Dictionary must map types to labels:
                         `{'presynapse': 7, 'postsynapse': 8}`
     include_subdirs :   bool, optional
                         If True and `f` is a folder, will also search
@@ -293,7 +299,7 @@ def read_swc(f: Union[str, pd.DataFrame, Iterable],
                         and joining processes causes overhead and is
                         considerably slower for imports of small numbers of
                         neurons. Integer will be interpreted as the
-                        number of cores (otherwise defaults to
+                        number of processes to use (defaults to
                         `os.cpu_count() // 2`).
     precision :         int [8, 16, 32, 64] | None
                         Precision for data. Defaults to 32 bit integers/floats.
@@ -325,16 +331,23 @@ def read_swc(f: Union[str, pd.DataFrame, Iterable],
     read_meta :         bool
                         If True and SWC header contains a line with JSON-encoded
                         meta data e.g. (`# Meta: {'id': 123}`), these data
-                        will be read as neuron properties. `fmt` takes
+                        will be read as neuron properties. `fmt` still takes
                         precedence. Will try to assign meta data directly as
                         neuron attribute (e.g. `neuron.id`). Failing that
                         (can happen for properties intrinsic to `TreeNeurons`),
                         will add a `.meta` dictionary to the neuron.
-    limit :             int, optional
-                        If reading from a folder you can use this parameter to
-                        read only the first `limit` SWC files. Useful if
-                        wanting to get a sample from a large library of
-                        skeletons.
+    limit :             int | str | slice | list, optional
+                        When reading from a folder or archive you can use this parameter to
+                        restrict the which files read:
+                         - if an integer, will read only the first `limit` SWC files
+                          (useful to get a sample from a large library of skeletons)
+                         - if a string, will interpret it as filename (regex) pattern
+                           and only read files that match the pattern; e.g. `limit='.*_R.*'`
+                           will only read files that contain `_R` in their filename
+                         - if a slice (e.g. `slice(10, 20)`) will read only the files in
+                           that range
+                         - a list is expected to be a list of filenames to read from
+                           the folder/archive
     **kwargs
                         Keyword arguments passed to the construction of
                         `navis.TreeNeuron`. You can use this to e.g. set
@@ -368,7 +381,7 @@ def read_swc(f: Union[str, pd.DataFrame, Iterable],
 
     >>> s = navis.read_swc('skeletons.zip')                     # doctest: +SKIP
 
-    Sample first 100 SWC files in  a zip archive:
+    Sample the first 100 SWC files in a zip archive:
 
     >>> s = navis.read_swc('skeletons.zip', limit=100)          # doctest: +SKIP
 

diff --git a/navis/utils/misc.py b/navis/utils/misc.py
@@ -96,6 +96,8 @@ def is_url(x: str) -> bool:
     False
     >>> is_url('http://www.google.com')
     True
+    >>> is_url("ftp://download.ft-server.org:8000")
+    True
 
     """
     parsed = urllib.parse.urlparse(x)