Mac OS X has a real time file indexing system called Spotlight. By itself, it's not terribly useful for me. What makes it useful is that it's accessible from the command line. I can search filenames:

locate mapgen2.as
mdfind -name mapgen2.as

The locate command relies on a database that's rebuilt nightly; mdfind is updated in real time. Spotlight also indexes the contents, which gives me something similar to recursive fgrep:

fgrep -R word *
mdfind -onlyin . word

It won't search for arbitrary regexps but for quickly narrowing down files to find certain words, it's useful.

The way Mac OS works though, to index a file, the OS needs to know which application knows how to interpret it. This makes sense for the binary application-specific files that most people deal with but I mostly work with text files. This post explains how to get Spotlight to index source code: Mac OS comes with a "rich text indexer", and you can tell it to include source code.

Update: [2014-06-06] I switched to a different way to do this that I like better.

Time for a detour. Mac OS uses "Uniform Type Identifiers" to mark the type of a file. These are things like public.html or com.apple.quicktime-movie. These types show up as kMDItemContentType in Spotlight's index. Mac OS also can use file extensions, mime types, or old Mac "kinds" to determine the type of a file. File extensions like .html somehow map to public.html but I don't quite understand where that happens. As far as I can tell, file extensions and type identifiers are declared by GUI applications. You can also use the command line tool duti to map file extensions to applications, or type identifiers to applications. I don't know how to tell the system to make files with a paricular extension have a type identifier that I give it.

To tell the rich text indexer what files to index, I need to add the type identifiers (not file extensions) to /System/Library/Spotlight/RichText.mdimporter/Contents/Info.plist. Some source code is recognized (somehow) by Mac OS and has type identifiers like public.python-script or com.sun.java-source. However, other files aren't recognized, and I want them indexed too. This command gives a list of everything registered:

/System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -dump

(While looking through there I found that it pointed to lots of apps I no longer had, so I used the -kill -seed flags to rebuild that database.)

I looked at other files with mdls -name kMDItemContentType foo.css and found that Mac OS gives these random-looking type identifiers like dyn.ah62d4rv4ge80g65x. Looking at more files, I found that the type identifiers are consistent across files with the same extension. So that means for each extension, I need to find its type identifier. Here's how I did that:

# These extensions are text files
for ext in org md hx bxml mxml hxml dot h hxx c cc py pl rb el js ts as css scss html sh java; do
  file=$(mdfind -onlyin $HOME -name .$ext | grep ${ext}$ | head -1)
  echo '          <string>'$(mdls -raw -name kMDItemContentType "$file")'</string><!-- '"$ext"' -->'
done

I edited the rich text importer file and put the type identifiers into it, and then reloaded the rich text importer:

sudo vi /System/Library/Spotlight/RichText.mdimporter/Contents/Info.plist
mdimport -r /System/Library/Spotlight/RichText.mdimporter

There are some file extensions that match existing type identifiers. For example, Typescript's .ts files show up as public.mpeg-2-transport-stream. I'm just going to ignore this problem and have them indexed as text.

This seems to work! I tested an .org file and a .java file and both were indexed.

However, I think there's a better way! There's some reading on Apple's site about both file extension to type identifier mapping, and also type identifier inheritance. With type inheritance, instead of adding all the type identifiers to the rich text indexer, I'd want to make all those type identifiers inherit from public.source-code, which inherits from public.plain-text, which is indexed already. There's also public.shell-script which inherits from public.script which inherits from public.source-code. The lsregister -dump command shows what inherits from what. If I could set up new type identifiers with inheritance, they'd automatically be indexed, and it'd also work with any other parts of the system that use type identifiers. I think this page tells me how to do that, but I haven't figured out where I should put that XML. That's for another day. (Update: [2014-06-06] I did this.) Today's solution is to put all the type identifiers into the rich text importer.

Labels: ,

0 comments: