Tutorial¶
almirah
is a tool to organize, query, and summarize data. In this
tutorial we will use almirah
on a dummy test dataset that
represents imaging data obtained from mice to illustrate some of its
usecases. In case you have not installed almirah
yet, please go to
the Installing section and follow the instructions.
The dummy dataset that we will be using comes along with almirah
and can be created at a convenient location using
create_tutorial_dataset()
.
from almirah.utils import create_tutorial_dataset
create_tutorial_dataset("path/to/store")
The dataset contents can be listed using the tree
command.
$ tree "path/to/store"
.
├── 72
│ ├── DAY01_G72_20180201.npy
│ └── DAY02_G72_20180202.npy
├── DAY01_G171_20180101.npy
├── Day2_171_20180102.npy
└── G433
├── DAY_G433_20180301.npy
└── day02_G433_20180301.npy
3 directories, 6 files
The dataset is not formatted and is difficult to navigate as can be seen from content hierarchy. Let us attempt to change this.
Defining a Specification¶
The first step to all things almirah
can do is defining the
Specification details. To come up with a
Specification
for this dataset, let us think
about how we would structure this dataset manually.
Deciding path patterns¶
We would prefer to navigate in the following order from the dataset
root directory, mice
-> day
-> file
. This is intuitive and
leaves scope for expansion on new data generation. Based on this, the
path relative to the dataset root would look like mice/day/file. The
file name should contain the details of mice, day, and the type of
imaging to uniquely identify it from others just using filename. For
example, a nice file name would be mice-G433_day-01_imaging-calcium.npy.
The above decided path can be mentioned in the
Specification details under the path_patterns
key
like:
path_patterns:
- "mice-{mice}/day-{day}/mice-{mice}_day-{day}_imaging-{imaging}{extension}"
Here, contents enclosed in {}
represent that these are tag
values. A File
in almirah
is associated with a
bunch of tags. It is possible to provide more details on tags
regarding valid values, the default value, and if the
Tag
is mandatory.
Building paths¶
Now that we have the basic Specification details, we
can build valid paths by providing the required tags. The
Specification details can either be stored as a YAML file and the
Specification
created using
create_from_file()
, or the contents
can be passed directly as a dictionary via the details argument of
Specification
.
Tags are key:value pairs that convey an information regarding the file. For
our decided path, the tags we require for path building are mice
,
day
, imaging
, and extension
.
from almirah import Specification
# Create a Specification object
spec = Specification.create_from_file("/path/to/config")
# Check out paths built based on tags provided
spec.build_path(mice='G433', day='01', imaging='calcium', extension='.npy')
# mice-G433/day-01/mice-G433_day-01_imaging-calcium.npy
spec.build_path(mice='G41', imaging='calcium', extension='.npy')
# Returns None as no valid path be built using only these tags
Organizing the dataset¶
With the Specification
defined, it is now
possible to restructure a dataset given that we are able to retrieve
the tags required to build a valid path from each target file. If we
are able to build a valid relative path, then it just boils down to
just moving the file to that position with respect to the root
directory. Lets us do this for our dataset.
Required parameters to begin organizing a dataset are:
source
: The root directory of unorganized data.destination
: The directory where organized data will be stored.pattern
: All filenames matching this pattern are considered for organizing. For our dataset, this can be*.npy
.
Establishing tag rules¶
Let us consider all the file names in the dataset, namely:
DAY01_G72_20180201.npy
DAY02_G72_20180202.npy
DAY01_G171_20180101.npy
Day2_171_20180102.npy
DAY_G433_20180301.npy
day02_G433_20180301.npy
We notice that the day
tag value is at the beginning between the
string day (ignoring case) and an _ (underscore). This can be
extracted with the regex pattern (?i)day([0-9]+)_
. But one of the
files does not have any info on the day value and for this no value
will be captured. Since we have exclusive information from the person
who performed the experiments, we know it is 1
and by default if
no value is captured we would like to choose this. Further, to
maintain uniformity we would like to ensure that the length of day
tag is 2
, and if not prefix it with 0
. Putting these together
as a tag retrieval rule:
tags_rules:
- name: day
pattern: "(?i)day([0-9]+)_"
length: 2
iffy_prefix: 0
default: 1
Similarly, we see that the value following the first underscore gives
the mice
info. This can be extracted with the regex pattern
_([a-zA-Z0-9]+)_
. Since one of the files does not have the
mice
info beginning with the suffix G, we choose to extract only
the numbers with the regex pattern using _G?([0-9]+)_
and prepend
it with G. Putting these together as a rule:
tags_rules:
- name: mice
pattern: "_G?([0-9]+)_"
prepend: G
There is no value for the imaging
tag in path but we know that it
is calcium imaging data, so let us set all values to calcium. And
similarly, let extension
be .npy.
tags_rules:
- name: imaging
value: calcium
- name: extension
value: npy
Putting these together, we get our organization rules config:
source: "/path/to/unorganized/data"
destination: "/path/to/store/organized/data"
pattern: "*.npy"
tags_rules:
- name: day
pattern: "(?i)day([0-9]+)_"
length: 2
iffy_prefix: 0
default: 1
- name: mice
pattern: "_G?([0-9]+)_"
prepend: G
- name: imaging
value: calcium
- name: extension
value: npy
Executing the rules¶
To run this set of rules and obtain the organized dataset, we can use
organize()
.
from almirah.utils import read_yaml
# Load rules into dictionary
rules = read_yaml("/path/to/rules")
# Execute rules and use Specification for path building
spec.organize(rules)
Listing the contents of the organized dataset using tree
after the
run gives:
.
├── mice-G72
│ ├── day-01
│ │ └── mice-G72_day-01_imaging-calcium.npy
│ └── day-02
│ └── mice-G72_day-02_imaging-calcium.npy
├── mice-G171
│ ├── day-01
│ │ └── mice-G171_day-01_imaging-calcium.npy
│ └── day-02
│ └── mice-G171_day-02_imaging-calcium.npy
└── mice-G433
├── day-01
│ └── mice-G433_day-01_imaging-calcium.npy
└── day-02
└── mice-G433_day-02_imaging-calcium.npy
This is much easier to navigate and structured. Along with this comes
the bonus of being able to query the dataset once indexed. In
almirah
, a collection of files with a common root directory form a
Layout
and a collection of layouts form a
Dataset
.
Querying¶
Tags associated with files can help in filtering through the dataset. We can retrieve a subset of files that satisfy certain tags. An operation called indexing makes this possible.
The indexing operation¶
In brief, the indexing operation on a Layout
or a
Dataset
crawls top-down from the root directory and
figures out the tag:file associations. These associations are
stored and used while querying to filter through the
dataset. Additional associations for a file excluding tags present in
the path can be provided with an accompanying JSON file that shares
the same filename. Indexing is performed by
index()
.
To index our tutorial dataset:
from almirah import Layout
# Create a Layout instance with data path and Specification
layout = Layout(root="/path/to/organized/data", specification_name=spec.name)
print(layout)
# <Layout root: '/path/to/organized/data'>
The returned instance does not yet have all details of files in it and tags for each file. For this, it has to be indexed.
# Index the Layout with tags mentioned in Specification
layout.index()
print(f"Files in layout: {len(layout.files)}")
# Files in layout: 6
print(f"The first file is {layout.files[0]}.")
# The first file is <File path=`/path/to/file`>.
Important
Before Layout
creation, it is required that a
Specification
be created. The
specification_name
argument checks for the index to retrieve
the Specification
. This can be achieve like
so:
With this, we can start querying in multiple ways.
More components!!¶
Great! Now, we have organized a dataset and made it
query-able. Though, we have used the word dataset liberally so far,
in almirah
, a Dataset
and Layout
are conceptually different.
A Layout
is a collection of files in a directory that
follow a Specification
. A collection of
Layouts make up a Dataset
. These Layouts are
said to be the components of the Dataset
.
Now, with this in mind, we can see that we organized the tutorial
dataset into a Layout
. If we have more Layouts
related to the project, we can collect them under a
Dataset
.
from almirah import Dataset
# Create a Dataset
dataset = Dataset(name="tutorial")
# Add out previous Layout to the Dataset
dataset.add(layout)
# If you have more, you can add them too
Operations performed on a Layout
like indexing and
querying can be performed on a Dataset
too using
index()
and query()
.
These will perform the operation on all the components of the dataset.
Databases as components¶
Often, datasets have relevant information stored in databases. If we
miss out on these, the dataset is not complete. Worry not, a
Database
can be added as a component to a
Dataset
. Further, Database
can be
queried to retrieve records using query()
too. Queries with the parameter table
to a
Dataset
are directed to its component databases
only. Connection to the Database
has to be
established using connect()
during each
Python session. This is because, the index
does not store
usernames and passwords for security reasons.
from almirah import Database
# Create a Database instance for a mysql db 'tutorial' running locally
db = Database(name="tutorial", host="127.0.0.1", backend="mysql")
db.connect(username, password)
# Add db to Dataset
dataset.add(db)
A Database
by SQLAlchemy is in connection
mode. An alternative to this, is the request mode. This is only
applicable for retrieving records and not other operations. To set
this mode on object creation, provide the parameter
backend="request"
:
# Create a Database in 'request' mode
db = Database(name="tutorial", host="127.0.0.1", backend="request")
# Connect to end point
db.connect(username, password)
In this mode, queries are sent as POST requests with appropriate
authentication header to the url endpoint and the response is chosen
as the result. For this to work, you should have setup a url endpoint
on the server that the Database
instance is running
on and setup appropriate authentication on host/authenticate to
return a token. This url endpoint processes the request appropriately,
and returns the records as a response in JSON. One way this can be
achieved is by extracting POST request keys and passing them as
arguments to query()
in the view that handles
requests to the url endpoint.
Another mode of operation, deals with data in Google sheets. Here,
each spreadsheet is treated as the Database and each worksheet in the
spreatsheet is analogous to a table. To read data from sheets, set
backend="gsheet"
and to connect provide the keyfile.
Note
Please have a look at the Create service account keys section of Google workspace docs to create a keyfile.
# Create Database in 'gsheet' mode
db = Database(name='any name', host='gsheet url', backend='gsheet')
# Connect to gsheet via Google API
db.connect(keyfile='/path to keyfile')
Important
Only databases supported by SQLAlchemy are functional via
almirah
for operations other than reading.
Other capabilities¶
This tutorial is a quick run-through some of the things almirah
can help with. There are other things it can do:
Migrate database records from one schema to another
Interface with DataLad compatible datasets
Please look up API Reference if you need these. If there is a
feature you would like, but almirah
does not support it yet,
please do consider raising a request or Contributing.