DocrepoX Content
Seamless Document Management with DocrepoX: Bulk Uploading and Indexing at Scale
By Harlin at
DocrepoX makes it easy to upload documents, whether you’re adding a few files or managing a much larger batch. If you just need to upload a file or two, you can simply drag and drop from your local drive. But when you’re dealing with hundreds or even thousands of documents, manually selecting files through your browser isn’t practical.
To handle large-scale uploads efficiently, DocrepoX includes a **bulk upload tool**. Instead of navigating through upload dialogs, you can send entire folders directly from the command line. This streamlines the process, saving time and reducing effort.
For example, I use DocrepoX to store all my university coursework. At the end of each semester, rather than uploading each document one by one, I use the bulk upload tool to send everything in a single command. It’s fast, efficient, and keeps my documents well-organized with minimal effort.
Setting Up for Bulk Upload
Before uploading, it’s best to have a dedicated project in DocrepoX to keep things organized. In my case, I use **University Documents** to store everything related to my coursework. If you don’t have a project yet, you’ll want to create one.
DocrepoX expects files to be placed inside its directory structure, typically under `docrepox_install/docrepox`. For this example, I’ve created a **`tmp/`** folder and copied my documents into it while keeping the structure that makes sense for me. Here’s how it looks:
Running the Bulk Upload
Now that the folder structure is set up, the next step is getting these files into DocrepoX efficiently. The **bulk upload tool** simplifies this process, allowing you to upload entire directories with a single command.
From the **docrepo** folder, run the following commands:
Important Notes
- The target folder in the repository **must exist** before running the command. Otherwise, you'll see an error like this:
For this example, I’ll use the following command to upload my **Term 8** documents:
Once the upload begins, you should see messages confirming the creation of subfolders and the successful upload of documents:
Verifying the Upload and Indexing Documents
Once the upload completes, you can check the **folder view in DocrepoX**, where you should see the same file structure and documents that were just uploaded.
However, keep in mind that the **bulk upload tool only transfers documents**—it doesn’t automatically make them searchable or previewable. For that, they need to be properly indexed.
Indexing Your Documents
To ensure your documents are fully integrated into DocrepoX’s search and preview system, you’ll need to run the **indexing command**:
This processes the newly uploaded files, making them accessible through search and enabling previews where supported. Running this step after bulk uploads ensures your documents are fully ready to use within DocrepoX.
To make the uploaded documents searchable and previewable, run the indexing command:
If you have debug turned on, you should output like the following:
As you can see, DocrepoX goes through a process of reconciling any missing indexes, creating a preview (as a pdf file so that it can be displayed in your browser), then the pdf has text extracted and is indexed for search later on.
If you’re handling large volumes of documents regularly, this entire workflow—uploading and indexing—can be automated. By scripting the process, DocrepoX can continuously monitor and ingest new or modified documents without manual intervention.
For example, you can use a tool like rsync to sync files from a source server to your DocrepoX installation, triggering bulk uploads and indexing as soon as changes are detected. This approach ensures that your document repository stays up to date, fully indexed, and ready for use—whether you're managing business records, research archives, or university coursework.
With DocrepoX, handling document ingestion at scale is not just possible—it’s streamlined, efficient, and fully adaptable to your workflow.
To handle large-scale uploads efficiently, DocrepoX includes a **bulk upload tool**. Instead of navigating through upload dialogs, you can send entire folders directly from the command line. This streamlines the process, saving time and reducing effort.
For example, I use DocrepoX to store all my university coursework. At the end of each semester, rather than uploading each document one by one, I use the bulk upload tool to send everything in a single command. It’s fast, efficient, and keeps my documents well-organized with minimal effort.
Setting Up for Bulk Upload
Before uploading, it’s best to have a dedicated project in DocrepoX to keep things organized. In my case, I use **University Documents** to store everything related to my coursework. If you don’t have a project yet, you’ll want to create one.
DocrepoX expects files to be placed inside its directory structure, typically under `docrepox_install/docrepox`. For this example, I’ve created a **`tmp/`** folder and copied my documents into it while keeping the structure that makes sense for me. Here’s how it looks:
(.venv)
harlin@legionpro5-16irx8:docrepo $ cd tmp/
(.venv)
harlin@legionpro5-16irx8:tmp $ ls
Term8
(.venv)
harlin@legionpro5-16irx8:tmp $ cd Term8
(.venv)
harlin@legionpro5-16irx8:Term8 $ ls
'Operating Systems' 'Software Engineering' 'World Literature'
(.venv)
harlin@legionpro5-16irx8:Term8 $ ls Operating\ Systems/
Week1 Week2 Week3 Week4 Week5 Week6 Week7
(.venv)
harlin@legionpro5-16irx8:Term8 $ ls Operating\ Systems/Week7
Assignment7.docx Assignment7.txt
Running the Bulk Upload
Now that the folder structure is set up, the next step is getting these files into DocrepoX efficiently. The **bulk upload tool** simplifies this process, allowing you to upload entire directories with a single command.
From the **docrepo** folder, run the following commands:
source .venv/bin/activate # Activate virtual environment (if applicable)
python manage.py upload local_folder '/ROOT/Path/To/My Folder' --owner myuser
Important Notes
- The target folder in the repository **must exist** before running the command. Otherwise, you'll see an error like this:
CommandError: Required folder '/ROOT/Projects/University Documents/Extra Folder' does not exist.
For this example, I’ll use the following command to upload my **Term 8** documents:
python manage.py upload tmp/Term8 '/ROOT/Projects/University Documents/' --owner testuser1
Once the upload begins, you should see messages confirming the creation of subfolders and the successful upload of documents:
Created subfolder: /ROOT/Projects/University Documents/World Literature
Created subfolder: /ROOT/Projects/University Documents/World Literature/Week6
Uploaded document: /ROOT/Projects/University Documents/World Literature/Week6/Discussion6.txt
Created subfolder: /ROOT/Projects/University Documents/World Literature/Week1
Uploaded document: /ROOT/Projects/University Documents/World Literature/Week1/Discussion1.txt
Created subfolder: /ROOT/Projects/University Documents/World Literature/Week2
Uploaded document: /ROOT/Projects/University Documents/World Literature/Week2/Assignment2.docx
Uploaded document: /ROOT/Projects/University Documents/World Literature/Week2/Assignment2.txt
Uploaded document: /ROOT/Projects/University Documents/World Literature/Week2/Assignment2.odt
...
Verifying the Upload and Indexing Documents
Once the upload completes, you can check the **folder view in DocrepoX**, where you should see the same file structure and documents that were just uploaded.
However, keep in mind that the **bulk upload tool only transfers documents**—it doesn’t automatically make them searchable or previewable. For that, they need to be properly indexed.
Indexing Your Documents
To ensure your documents are fully integrated into DocrepoX’s search and preview system, you’ll need to run the **indexing command**:
python manage.py run_indexing
This processes the newly uploaded files, making them accessible through search and enabling previews where supported. Running this step after bulk uploads ensures your documents are fully ready to use within DocrepoX.
To make the uploaded documents searchable and previewable, run the indexing command:
python manage.py run_indexing
If you have debug turned on, you should output like the following:
DEBUG 2025-03-15 12:52:18,865 indexing.reconcile_missing_indexes: Reconciling missing indexes ...
DEBUG 2025-03-15 12:52:18,872 indexing.index_document: Getting preview for document: /ROOT/Projects/University Documents/Software Engineering/We…/LearningJournal3.docx|content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4|1.0
DEBUG 2025-03-15 12:52:18,872 indexing.index_document: Preview does not exist for document: /ROOT/Projects/University Documents/Software Engineering/We…/LearningJournal3.docx|content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4|1.0. Creating one.
DEBUG 2025-03-15 12:52:18,873 core.generate_pdf_file: Checking for SOFFICE_EXE install ...
DEBUG 2025-03-15 12:52:18,873 core.generate_pdf_file: /usr/bin/soffice found.
DEBUG 2025-03-15 12:52:18,874 core.generate_pdf_file: File to be used for PDF generation: content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4
DEBUG 2025-03-15 12:52:18,874 core.generate_pdf_file: Document name is: LearningJournal3.docx
DEBUG 2025-03-15 12:52:18,877 core.generate_pdf_file: Logical path: /ROOT/Projects/University Documents/Software Engineering/We…/LearningJournal3.docx
DEBUG 2025-03-15 12:52:18,877 core.generate_pdf_file: File extension is: .docx
DEBUG 2025-03-15 12:52:18,877 core.generate_pdf_file: File size is: 7054
DEBUG 2025-03-15 12:52:18,877 core.generate_pdf_file: SOFFICE path is set to: /usr/bin/soffice
DEBUG 2025-03-15 12:52:18,877 core.generate_pdf_file: File: content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4 has an allowed extension type: .docx. Preview transform will be attempted.
DEBUG 2025-03-15 12:52:18,878 core.generate_pdf_file: Using command for transform: /usr/bin/soffice --headless --convert-to pdf --outdir /usr/src/docrepo/mediafiles/content/tmp /usr/src/docrepo/mediafiles/content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4
convert /usr/src/docrepo/mediafiles/content/2025/3/15/12/48/53d4b4e9-d737-4993-9298-f60ebd113af4 as a Writer document -> /usr/src/docrepo/mediafiles/content/tmp/53d4b4e9-d737-4993-9298-f60ebd113af4.pdf using filter : writer_pdf_Export
DEBUG 2025-03-15 12:52:19,923 core.generate_pdf_file: Temp file for upload is /usr/src/docrepo/mediafiles/content/tmp/53d4b4e9-d737-4993-9298-f60ebd113af4.pdf
DEBUG 2025-03-15 12:52:19,923 core.generate_preview_file: Generating preview file from /usr/src/docrepo/mediafiles/content/tmp/53d4b4e9-d737-4993-9298-f60ebd113af4.pdf
DEBUG 2025-03-15 12:52:19,923 core.generate_preview_file: Attempting to save preview content file: d5379645-04a9-483b-8e93-0274f5fc40a1
DEBUG 2025-03-15 12:52:19,925 core.generate_preview_file: Preview content file saved.
DEBUG 2025-03-15 12:52:19,925 core.generate_preview_file: Preview file creation successful. Removing temp file: /usr/src/docrepo/mediafiles/content/tmp/53d4b4e9-d737-4993-9298-f60ebd113af4.pdf
DEBUG 2025-03-15 12:52:19,925 core.generate_preview_file: Removing tmp_file: /usr/src/docrepo/mediafiles/content/tmp/53d4b4e9-d737-4993-9298-f60ebd113af4.pdf
DEBUG 2025-03-15 12:52:19,926 indexing.index_document: Extracting text from: /usr/src/docrepo/mediafiles/content/2025/3/15/12/52/fd231e10-2679-4849-9591-a1d31ec15e71
DEBUG 2025-03-15 12:52:19,931 indexing.index_document: Updating index for document: /ROOT/Projects/University Documents/Software Engineering/We…/LearningJournal3.docx
DEBUG 2025-03-15 12:52:19,932 indexing.index_document: Indexed /usr/src/docrepo/mediafiles/content/2025/3/15/12/52/fd231e10-2679-4849-9591-a1d31ec15e71 successfully
As you can see, DocrepoX goes through a process of reconciling any missing indexes, creating a preview (as a pdf file so that it can be displayed in your browser), then the pdf has text extracted and is indexed for search later on.
If you’re handling large volumes of documents regularly, this entire workflow—uploading and indexing—can be automated. By scripting the process, DocrepoX can continuously monitor and ingest new or modified documents without manual intervention.
For example, you can use a tool like rsync to sync files from a source server to your DocrepoX installation, triggering bulk uploads and indexing as soon as changes are detected. This approach ensures that your document repository stays up to date, fully indexed, and ready for use—whether you're managing business records, research archives, or university coursework.
With DocrepoX, handling document ingestion at scale is not just possible—it’s streamlined, efficient, and fully adaptable to your workflow.
Want to get in touch?
Or leave a comment ...