Skip to content

GSoC 2024 ‐ Lucene Search Backend Integration

Loay Ghreeb edited this page Oct 6, 2024 · 1 revision
Student Loay Ghreeb
Organization JabRef e.V.
Primary repository JabRef/Jabref
Project name Lucene Search Backend Integration
Project mentors @koppor, @calixtus, @HoussemNasri
Project page Google Summer of Code 2024 Project Page
Status Complete

Project summary

JabRef offers an extensive search functionality that allows users to search for entries in their libraries. This project integrates Lucene, a powerful full-text search engine, to improve search performance and enhance the user experience.

Lucene provides an intuitive and standardized search syntax which is widely used across various applications.

The integration of Lucene introduces advanced search capabilities, including:

  • Fuzzy searches (finding words with similar spelling)
  • Proximity searches (finding words within a specific distance)
  • Range searches (finding numeric or date values within a range)
  • Term boosting (prioritizing certain search terms)

These features allow users to find relevant entries more effectively.

Beyond improving the user experience, the migration to Lucene introduces an abstract search layer for JabRef. By decoupling the search functionality from JavaFX, the project creates a modular architecture that allows for the easy integration of different search engines in the future with minimal coding effort. This flexibility ensures that JabRef's search infrastructure can evolve and adapt as needed.


Pull requests to main branch

#11542 — Lucene Search Backend

This PR is the primary deliverable of the project. It introduces the Lucene search backend to JabRef, replacing the previous search backend. The implementation includes:

  • Core Search Functionality:

    • Indexed all bibliographic fields and implemented real-time indexing updates to reflect changes in the library.
    • Replaced the old search syntax with Lucene search syntax, enabling advanced search capabilities.
    • Search scoring and sorting, with a new search score column for ranking results by relevance.
    • Added support for searching LaTeX-encoded special characters using either Unicode or LaTeX encoding.
    • Resolved issues with search result highlighting in the entry preview and Bibtex source tab, including incorrect highlighting of boolean operators and broken with complex searches.
  • Full-text Search Improvements:

  • Search Groups Improvements:

    • Migrated free-search expression groups to use Lucene syntax, and implemented automatic migration from the old search syntax.
    • Improved performance by caching search group results, reducing re-evaluation time when switching between search groups.

#11542 — Search floating mode

This PR reintroduces floating mode, which was removed from JabRef in 2017 when JabRef migrated to JavaFX, allowing more flexible display options in the main table.

  • Floating Mode:

    • Reintroduced floating mode in the main table, allowing users to switch between groups without filtering the table.
    • In floating mode, entries from selected groups are displayed at the top with a default color, while non-group entries are displayed in a different color at the end of the table.
    • Added two new toggles to control entry visibility:
      • A toggle to filter selected groups: When enabled and a group is selected, only entries matching the group are shown.
      • A toggle to filter search results: When enabled and there is an active search query, only entries matching the search are shown.
  • Entry Ranking and Sorting:

    • In floating mode, entries are categorized and ranked based on their match to both search terms and selected groups:
      • Rank 1: Matches both the search term and selected groups.
      • Rank 2: Matches the search term only.
      • Rank 3: Matches the selected groups only.
      • Rank 4: Matches neither.
    • Each rank has a custom CSS style for easy visual distinction.
    • Added keyboard shortcuts (right and left arrows) to navigate between ranks.
    • New toggle to invert selected group(s), displaying entries that do not belong to the selected group(s).
  • Performance Improvements:

    • Performed searches in a background thread and updating the table in the UI thread to avoid blocking the UI until the search is complete.
    • Improved performance when switching between groups with an active search query; now the search is executed only once when the search query changes, rather than with each group switch.
    • Improved performance when opening multiple libraries, as updates to search terms and group selection affect only the current library rather than all open libraries.

#11579 — Update main table colors

  • Introduced new CSS styles for entries in the main table when using floating mode, applying distinct colors for each rank.

#11658 — Highlight search bar on wrong query syntax

  • Highlights the search bar in red when an invalid query is entered, providing immediate visual feedback to users.

#11719 — Update lucene version

  • Updates the Lucene version as a follow-up to the Lucene search PR. This change is necessary to keep the internal version tracked in case of any changes to the index configuration requiring an index rebuild.

#11749 — Fix search test NPE

#11624 — Fix exception when removing an entry from the selected group

11625 — Fix invalid regex query throws an exception

#11626 — Remove "Match category" from localization

11659 — Update PreviewView to listen for library search queries

#11446 — Fix context menu of the search bar

Pull requests related to project in other repositories

#5 — Fix MappedBackedList when the change is wasUpdated

#1519 — 8301761: The sorting of the SortedList can become invalid


Screenshots

Floating mode

Light Theme Floating_Mode_Light_Theme
Dark Theme Floating_Mode_Dark_Theme

Search results highlighting

Preview tab

Highlighting_Preview_Tab

Source tab

Highlighting_Source_Tab

Search bar with invalid query

Search_bar


Additional pull requests to improve searching in JabRef before and during GSoC

#11293 — Remove outdated pdf indexed files from Lucene index

#11301 — Remove EnglishStemAnalyzer

#11317 — Register index update listener

#11402 — Search result mapping use UIDs instead of hashcode of BibDatabaseContext

#11432 — Improve document viewer

#11032 — Use a separate search bar in the global search window

#11172 — Set the initial global search query to the current normal search query

#11010 — Selected entry in search window selects entry in the main table

#11268 — Update search results when active database changes


Challenges and important things I learned during the project

Throughout my GSoC project, I encountered various challenges and gained valuable experience across multiple aspects of software development and open-source contribution. These experiences have enhanced my skills as a developer and my understanding of large-scale software projects.

  • Open source contribution: Gained a deep understanding of the open-source contribution process, including submitting pull requests, reviewing code, and collaborating with maintainers and contributors. This experience helped me become familiar with best practices for contributing to large, active open-source projects.
  • Working with a large codebase: Gained experience on how to work with a large codebase, navigate through different modules, and understand the architecture of the application.
  • JavaFX and MVVM architecture: Gained practical experience with JavaFX and the MVVM architecture, which is used extensively in JabRef.
  • Resolving large conflicts: One of the major challenges I faced was resolving large conflicts in code. I started on an existing PR, which started two years ago. Since then, the codebase had evolved. I learned how to handle these merge conflicts and bring the old code in line with the current architecture.
  • Search engines: Gained a good understanding of search engines, particularly Apache Lucene, including concepts like full-text search, inverted indexes, analyzers, tokenizers, query parsing, and scoring.
  • Multithreading and performance optimization: Learned how to separate the searching and indexing processes from the UI thread to enhance performance, avoiding UI freezes during these operations.
  • ANTLR and visitor pattern: To migrate JabRef's old ANTLR custom search syntax to Lucene query syntax, I had to learn about ANTLR and the visitor pattern. This was a completely new concept for me, and I gained fundamental knowledge about how ANTLR is used for parsing complex language constructs.
  • Debugging: Gained significant experience in identifying and fixing bugs, both within JabRef's codebase and external libraries.
  • Testing: Gained hands-on experience writing and maintaining test cases using JUnit, included implementing unit tests for search migrations and conducting end-to-end tests to ensure the search functionality worked as expected.

Special thanks to my mentors @koppor and @calixtus for their guidance and support throughout the project. Without their mentorship, I wouldn't have been able to successfully complete the project.


Clone this wiki locally