Modify file converter API to support retrieving file metadata, either synchronously or asynchronously. There are cases when dealing with files where we want to submit a file for analysis to extract data or metadata from that file. This is different to converting a file from one form to another.
An example use case is using an image recognition service to extract data about the content of an image. Or to extract the raw text from a PDF file. The Elasticsearch Global search plugin uses both of these example when indexing documents for Global Search.
Using the current file converter workflow to do this would look like:
- Submit file to converter
- Converter converts file
- Converted file is saved as new file on filesystem (e.g as json)
- File content is read
- Remove converted file
This seems inefficient when we want to get the metadata about a file or the file contents. There is a lot of file IO which shouldn't be required as there is no need to have a second “converted” copy of the file. Also depending on the use case this needs to have the option of being synchronous.
It would be better to have a call like:
- $converter->get_metadata($file, $format, $async)
- $converter->get_content($file, $format, $async) get_content is different to $file->get_content which simply reads file contents as string with no conversion
Each converter would then declare what file types they support extraction on, and what formats they can return data in (json, xml, etc.)