File Renamer Re-design

Design Principles

Caveat: I am mostly rewriting the script as an opportunity to learn Python and have some fun, so don't expect these to by particularly sensible! I do enough serious coding at work…

Pretty code
Modular — the Perl version of the script suffered really badly from being unstructured
A different function for each use-case. Thereby making the code semantic, and easier to extend.

Modularity

Although it is clear to me that the system should be modular, I am as yet undecided on the subject of Python modules. One of the reason I hesitate is that every Python module is a separate file, and I want to keep the script as a single file.

I suspect that one of the most attractive features of tvrenamer.pl was the minimal commitment and fuss required to try it out - just download a single file and run it. I personally love software that is as simple as that.

So for now, the plan is to define a class for each “module” (see below), and instantiate it just once if that is all it is needed for.

Modules

We need at least the following distinct entities which can be extended without requiring other modules to be updated. (Get the interfaces correct!)

Module	Status	Notes
Main		Have some example code on modules Vs standalone scripts
Store	Done
Procure	EpGuides + AniDB + theTVDB done	Done. (and has a functional plugin system!)
Survey
Name constructors
Suggestor
User-interface		wxWidgets and later OpenGL\|ES 2.0 (probably on my Pandora, when it arrives)
Preferences	Done	Standalone test atm

Main

Encapsulates an instance of the script, and controls overall program flow. This should be a distinct entity so that the script can “restart” execution without exiting, which will allow it to walk directory trees for recursive renaming.

Considerations

Should there be an instance of Preferences per instance of Main (i.e. reload preferences for each directory being renamed) or should Preferences be global? Or perhaps both?

Store

A database-like entity which retains all episode and season data extraced from various sources by other modules.

The store will tag all data with provenance info, such that when multiple sources feed the store conflicting details about the same episode, the store will retain both versions. This will allow another module to consider each source seperately (to pick the “best” one) or to mix-and-match sources as needed.

This will likely be a volatile entity which exists only at run-time, although it could be stored ala tvrenamer.pl's .cache files…

Procure

This module will be responsible for gathering episode data and feeding it to the Store.

This module will define a “Source” class which will define the interface between source-specific code and the Procure module. In other words, each website which the script supports will sub-class “Source”.

There are three broad categories of sources:

Websites
Files
User-input

TODO: Decide if each of the above should have it's own sub-class, to allow code reuse in an elegant manner.

Websites

Sources such as

With each site requiring different code to

Find the right page on the site (i.e. perform a search and pick the appropriate result(s))
Extract the episode details from the HTML

When a search returns multiple valid results and the script cannot reasonably discard all-but-one of the results, all members of the short list should be considered.

In other words, multiple results == multiple feeds to the Store module.

For this to work correctly, each result must have a unique “tag” in the Store, so that collisions are avoided.

Files

In essence an off-line version of Websites without the requirement to first find the appropriate page.

Source such as:

Copies of webpages
http://hairy.geek.nz/epg/latest.xml.gz

User-input

Generally something typed or copy'n'pasted from a website into the script.¹⁾

Survey

Examines names of existing files and determines/estimates metadata such as:

Series name
Season number²⁾
Episode number
Episode title³⁾
Correct CRC32⁴⁾
Release group
Audio Language
Subtitle language
File extension

Usage

This module should be used to inform the Procure and Suggester modules, it must therefore be run before either.

Considerations

Should this module use a similar approach to the Store, where all data extracted is stored (and tagged by it's source) so that another module can analyse all the sources to pick the best one, or mix-and-match sources as needed.

If yes, then it ought to also provide a measure of “confidence”. This allows contextual knowledge to be translated into context-free data. For example, consider the epsiode number “101” in the following two filenames:

MySeries.S01.E101.avi
MySeries.101.avi

In the first case the confidence that “101” is really the episode number is very high. However, in the second case “101” might actually mean “1×01”, so the confidence is low.

In such a case the onus to determine if 101 == 1×01 would be on the Suggestor module which determines the “best” set of filename changes for the user.

Name constructors

Generates filenames from episode data in the Store w.r.t user-preferences.

At its most basic, this module implements a very simple templating system which allows the user to specify how proposed filenames are generated.

This templating system will obsolete many of v2's options, such as:

–scheme
–group
–nogroup
–nogap
–gap=…
–include_series
–exclude_series
…

Because the Preferences system allows users to have multiple profiles, it should be practical to have different naming schemes for different types of series.

The module will attempt to understand the user's environment and use an appropriate profile: for instance if the complete path for the directory under consideration contains the word “anime” it will check for a profile called “auto_anime”.

To avoid name collisions, all auto-applied profiles should be prefixed with “auto_”. Some sensible default profiles will be distributed with the script, but will always be _optional_. It is not an error if a profile cannot be loaded.

Example templates

Unless I find good reason not to, I intend to simply “eval” a Python string substitution.

What are the implications of evaluating user-input as code for the EXE version? Is it practical?

# Template for "Series - 1x01 - Episode title.avi"
template1 = "%(series_name)s - %(season_number)sdx%(episode_number)2d - %(episode_title)s"
 
# Template for "S01E01 Episode title.mkv"
template2 = "S%(season_number)2dE%(episode_number)2d %(episode_title)s"
 
# And used in the script like so:
data = dict()
data["series_name"] = "Test Series"
data["season_number"] = 1
data["episode_number"] = 8
data["episode_title"] = "Test Episode Title"
 
# Note, this doesn't account for the file extension, but that should be trivial.
newfilename1 = template1 % data
newfilename2 = template2 % data

Note that the file extension is *not* specified in the template, it is implicitly added to the end of the string produced.

The script cannot ensure that all episode numbers are the same number of digits using this approach.

If the user does not specify a number of digits, they may get output like this:

...
Series 9 EpTitle.ext
Series 10 Eptitle.ext
...
Series 99 Eptitle.ext
Series 100 Eptitle.ext
...

And if they specify 2, which appears to be the consensus, they would get:

...
Series 09 EpTitle.ext
Series 10 Eptitle.ext
...
Series 99 Eptitle.ext
Series 100 Eptitle.ext
...

We will work around this by supplying a “padded_episode_number_as_string”, which will be a string, not a number.

We will need to provide a preference which allows the user to specify a minimum number of digits when expressing episode number, which will only affect the string version.

TODO: Check if the printf sytnax allows us to use a variable to specify the padding. Would a two-pass approach work? (“%%%dd” % 2 –> “%2d” % 0 → “00”). And if it did, would it confuse users?

Suggestor

Considers possible changesets, selects the “best” ones and asks the user to pick one.

Initially, this module will generate a change-set for each data-source that populated the store, using the active profile w.r.t name construction. However, future version will extend this behaviour to consider alternative profiles (in an attempt to minimise the number of files that need to be renamed) and merging multiple sources (in rare cases where some sources have “holes” in episode data).

Ranking heuristic

(This is the initial plan, I expect to extend this in future versions)

Conditions for success (can be relaxed by user intervention):

All files are accounted for
- For all files under consideration there exists relevant data in the Store

“Gentle-touch” measurement:

How many renamings does this change-set propose? Less is better.
- If we assume that most files under consideration have been renamed by a previous run of the script (and, implictly, that the non-conformant filenames are new additions since that last run) then fewer changes is a good indicator that we're using the same source as last time, and the same profile⁵⁾.
- This number will always be greater than 0, or there would be no change-set.
Are the files to be renamed numbered consecutively episode-wise? True is preferred.
- Given partially renamed files as above, this would provide further affirmation that update is sensible.
Is it mostly later episodes that need files renamed? True is better.
- Again for the partially-renamed use-case. If only later episodes are to be renamed, then the change is probably sensible.
- Likewise, when websites update their listings they're most likely to correct the most recent additions; which will tend to be the latest episodes.

User-interface

Interact with the user.

Will be a CLI initially, potentially a GUI later.

This will require hooks in other parts of the code for best affect: eventually there will be a flashy OpenGL interface showing the internals at work. Using Python's standard logger module to instrument the code is probably the right way to do this - it is a useful debugging aide anyways.

Preferences

Create, manage and utilise user-preferences.

Stores only the program options which have been modified by the user. Therefore files are “sparse” and contain holes in the option dictionary. The Preferences module will use defaults whenever a hole is encountered.

This permits multiple layers of preferences to exist, with each layer calling to the one below it when a hole is encountered. The lowest-layer of any preference stack will always be the hard-coded defaults.

When preferences are written to disk, the resulting file is called a “profile”. Some profile names will be reserved for the scripts use, to allow for common use-cases which the defaults may not be suitable for, such as “anime”⁶⁾.

Notes

There will be a global instance of Preferences and another instance per Main. In other words, each Main will look for a set of preferences specific to the directory under consideration. Therefore a run might look like this:

Script starts
Default preferences are initialised from hard-coded values
Global preferences loaded from default profile: ~/.tvrenamerrc
“Main” instance created
“Main” detects that current directory is anime
“Main” attempts to load anime preferences: ~/.tvrenamerrc.anime
“Main” attempts to load per-directory preferences: ./.tvrenamerrc

==

I've got a list of things which I want to add to the script, but it's gotten a bit crusty since I first wrote it. It's about time for a big rewrite, and to get it right I want to use this page to work out exactly what problems I am trying to solve.

(this page has good UML examples of using dot)

<format dot> digraph G {

      node [
              shape = "record"
      ]

      edge [
      ]

exec → init_a init_d → core_a

subgraph clusterInit { label = “Init” init_a [label=“Detect system capabilities”] init_b [label=“Detect context (cwd / anime / tv / etc)”] init_c [label=“Read in preference”] init_d [label=“Parse command line options”] init_a → init_b → init_c → init_d }

subgraph clusterCore { label = “Core” core_a [label=“Filename parser”] core_b [label=“New name constructor”] core_c [label=“Selection (interactive)”] core_d [label=“Perform changes + checks”] core_a → core_b → core_c → core_d } } </format>

¹⁾

This was in fact the only use-case for v1 of the script…

²⁾

New format: SXX suffixs, ala this post

³⁾

To allow missing season + episode numbers to be added

⁴⁾

Often anime downloads contain the CRC32 checksum in the filename, so you can verify that the file's CRC32 matches that in the filename

⁵⁾

Eventually I want the script to try alternative profiles tentativly, and rely on this heuristic to weed out silly ones

⁶⁾

For instance, anime collectors often wish to retain the group tags that indicates the fan-subbing group, but wish to discard group tags for general TV

RobMeerman.co.uk

Table of Contents

File Renamer Re-design

Design Principles

Modularity

Modules

Main

Considerations

Store

Procure

Websites

Files

User-input

Survey

Usage

Considerations

Name constructors

Example templates

Suggestor

Ranking heuristic

User-interface

Preferences

Notes

RobMeerman.co.uk

User Tools

Site Tools

Table of Contents

File Renamer Re-design

Design Principles

Modularity

Modules

Main

Considerations

Store

Procure

Websites

Files

User-input

Survey

Usage

Considerations

Name constructors

Example templates

Suggestor

Ranking heuristic

User-interface

Preferences

Notes

Page Tools