-
Bug
-
Resolution: Fixed
-
Minor
-
4.3.8, 4.4.4
The search indexing works as follows:
- Index all documents up to current time
- Next time indexing runs, start at the previous time indexing started
The times are inclusive, which means there should be an overlap of 1 second so in theory, all documents will be indexed, i.e. indexing will include documents from every possible second. For example, if indexing runs at 10:00:00.00 and a document is written at 10:00:00.50 (half a second later), the document will still be indexed, because the next indexing run will include documents from 10:00:00 to whenever that indexing run starts. (Some documents during that second might be indexed twice. Other than wasting time for indexing one document, it does not hurt to index the same document twice.)
However it can miss documents in the following case:
- A new document is written very near to the second (e.g. 09:07:09.991).
- The timecreated for this document is that second (e.g. 09:07:09) but it will not be completely saved to the database until some time into the following second (e.g. 09:07:10.027).
- Indexing runs earlier in the following second (e.g. 09:07:10.001).
In this scenario, the document is never indexed because it only got saved to the database after indexing, but the time created is in the previous second so it won't be included next time.
I wrote a complicated script to demonstrate this (attached). Here is the real output from my script corresponding to this case:
Creating the page:
Page 0003 09:07:09.991
|
09:07:10.027
|
timemodified 09:07:09
|
Indexing (starts before page is written, but does not include page even though it has the earlier second):
Now 09:07:10.001
|
Processing area: Page
|
Processed 1 records containing 1 documents, in 0 seconds.
|
Now 09:07:10.196
|
(The 'Processed' record here is not the page that was just added, but one of the previous ones.)
This scenario is quite unlikely as the indexing for that search area has to start at the exact time of the document being created, but:
- On a large site with indexing set to happen frequently (we run once per minute on ours) it could happen.
- With more complicated documents than a Page activity, it will be more likely that committing it to database takes some time, i.e. the timecreated value could end up in the previous second even if you didn't start writing it 0.99 seconds in.
To resolve this problem without any performance cost I propose changing it so that indexing does not include data from the previous N seconds (e.g. 5), and this data is instead covered in the next index run. That way if there is anything written to the database with an 'old' timecreated (up to 5 seconds old) it will still work. Realistically nothing should be later than that, or if it is, then it would be a bug in the activity creating the data. (E.g. if you calculate timecreated at the start of a transaction, then spend ages adding stuff to other tables, then write it to database. In that scenario you should work out timecreated at the end.)
This diagram shows the proposed change, with 'now' being the time that 'Indexing 1' runs:
In addition to the correctness improvement (for items that are added to database slightly late), this also has a slight performance advantage because we do not need to index the current second woth of documents twice, as indexing only works on documents 5 seconds in the past so there will be no new documents added with that date. Obviously this is a very small improvement; even if you run search every minute it will only save about 1/60 of the document indexing.
Test scripts (Behat and PHPunit) will rely on indexing completing immediately, i.e. the pattern will be to create an activity then immediately update the index. So as to avoid breaking both core and third-party automated tests, the delay will not be applied in test runs, which will continue to use the 'before' strategy. I put in a hacky $CFG variable to enable the new approach during testing, which we can use for a unit test of the new behaviour.